[
https://issues.apache.org/jira/browse/SOLR-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758625#comment-16758625
]
Mark Miller commented on SOLR-13189:
------------------------------------
{quote} * was bad in real life because if replica was having problems, it might
not recognize/respond to LIR apprpriate{quote}
It was fine from that perspective when Tim added LIR - the original
communication through ZK. The problem was that it was tied to each update
before, so if you had lots of fails, you would make tons of http calls and tons
of requests to recover (we throttle recoveries now to prevent this type of
thing). So that either needed to be removed, or made more efficient by not
linking every http call to a document fail. I think it's been removed or else
it's broken.
bq. this is good in real life because it's less dependent on healthy
network/http requests
We already had ZK based LIR on top of the http request attempt. I think the
rewritten improved LIR removed (rather than making efficient) or broke the
request attempt.
bq. this is bad in tests because there is an inherent and hard to predict delay
the replica even realizes it needs to go into recovery
It depends on the test. If you don't want flakey tests, all of them should obey
the rules of the system when checking things as much as possible. More
practically, the changed behavior mostly affects us injecting fails. That type
of test should be isolated and have correct checking. For the rest of the
tests, we probably don't expect fails and so failing if we have them seems
fine, something likely needs to be fixed or you are checking wrong.
bq. I haven't dug into your patch that deep, but so far is seems really
hackish?
markmiller: Here is a hack to that test.
This is just to fix your test.
bq. it makes the test wait (or timeout) until it is consistent
If you want to write a test like that, those are the rules, so that is what it
does. Recovery can be re-triggered and stuff can happen that will take a
consistent state longer than you might think it should take. So either your
test is not creating the env you think it is, or it is, and this is how you
properly test it.
> Need reliable example (Test) of how to use TestInjection.failReplicaRequests
> ----------------------------------------------------------------------------
>
> Key: SOLR-13189
> URL: https://issues.apache.org/jira/browse/SOLR-13189
> Project: Solr
> Issue Type: Sub-task
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Hoss Man
> Priority: Major
> Attachments: SOLR-13189.patch, SOLR-13189.patch, SOLR-13189.patch
>
>
> We need a test that reliably demonstrates the usage of
> {{TestInjection.failReplicaRequests}} and shows what steps a test needs to
> take after issuing updates to reliably "pass" (finding all index updates that
> succeeded from the clients perspective) even in the event of an (injected)
> replica failure.
> As things stand now, it does not seem that any test using
> {{TestInjection.failReplicaRequests}} passes reliably -- *and it's not clear
> if this is due to poorly designed tests, or an indication of a bug in
> distributed updates / LIR*
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]