[
https://issues.apache.org/jira/browse/SOLR-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758564#comment-16758564
]
Hoss Man commented on SOLR-13189:
---------------------------------
{quote}In older versions these tests might have worked because before the
request returns to the client, the leader would have called to the replica and
told it to go into recovery. I believe we no longer make these calls (for good
reason, http calls tied to updates was no good). So a replica will only enter
recovery when it realizes it should via ZooKeeper communication.
{quote}
Ok ... so to re-iterate and make sure i'm following everything:
* OLD LIR:
** LIR was pushed to replica ia HTTP immediately after replica returned
non-200 status
** was bad in real life because if replica was having problems, it might not
recognize/respond to LIR apprpriate
** was good in tests because it ment immediately after doing an index update,
you could {{waitForRecoveriesToFinish}} and the replica would already be in
recover
* CURRENT LIR:
** LIR status is managed via flags in ZK (this is the "terms" concept correct?)
** replicas monitor ZK to see if/when they need to go into LIR
** this is good in real life because it's less dependent on healthy
network/http requests
** this is bad in tests because there is an inherent and hard to predict delay
the replica even realizes it needs to go into recovery
*** ie: {{waitForRecoveriesToFinish}} now seems completley useless?
does that cover it?
{quote}The system will be eventually consistent, but there is no promise it
will be consistent even when all replicas are active. You must be willing to
wait a short time for consistency and this test does not.
{quote}
Right ... i understand that ... the question at the heart of this jira is what
a test can/should do to know "the system should now be consistent enough for me
to make the assertions I want to make" (and how do we make that as easy as
possible for tests to do).
I haven't dug into your patch that deep, but so far is seems really hackish?
... sleep looping until all the replicas are live the first 1000 docs from a
{{*:*}} of a query to each matches each other?
If nothing else this creates a (slow) chicken and egg diagnoses problem in
tests – did {{waitForConsistency}} eventually time out because the recovery is
broken, or because the code i'm writting a test for (example: distributed
atomic updates) is broken?
I'm not saying the {{checkConsistency}} logic is bad – if anything it seems
like something that might be good to have in the tear down of every test – but
I'm concerned that just trying to do a "wait for" on it doesn't really get to
the heart of the problem of tests being able to know when the cluster
*_should_* be consistent – it makes the test wait (or timeout) until it *_is_*
consistent)
----
If recovery is driven by these flags in ZK, then why couldn't we re-write
{{waitForRecoveriesToFinish}} to check those flags first (in addition to the
{{Replica.State}}) to know if recovery is pending (or in progress)
> Need reliable example (Test) of how to use TestInjection.failReplicaRequests
> ----------------------------------------------------------------------------
>
> Key: SOLR-13189
> URL: https://issues.apache.org/jira/browse/SOLR-13189
> Project: Solr
> Issue Type: Sub-task
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Hoss Man
> Priority: Major
> Attachments: SOLR-13189.patch, SOLR-13189.patch, SOLR-13189.patch
>
>
> We need a test that reliably demonstrates the usage of
> {{TestInjection.failReplicaRequests}} and shows what steps a test needs to
> take after issuing updates to reliably "pass" (finding all index updates that
> succeeded from the clients perspective) even in the event of an (injected)
> replica failure.
> As things stand now, it does not seem that any test using
> {{TestInjection.failReplicaRequests}} passes reliably -- *and it's not clear
> if this is due to poorly designed tests, or an indication of a bug in
> distributed updates / LIR*
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]