[jira] [Commented] (SOLR-13189) Need reliable example (Test) of how to use TestInjection.failReplicaRequests

Mark Miller (JIRA) Fri, 01 Feb 2019 11:54:35 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758625#comment-16758625
 ]


Mark Miller commented on SOLR-13189:
------------------------------------

{quote} * was bad in real life because if replica was having problems, it might 
not recognize/respond to LIR apprpriate{quote}
It was fine from that perspective when Tim added LIR - the original 
communication through ZK. The problem was that it was tied to each update 
before, so if you had lots of fails, you would make tons of http calls and tons 
of requests to recover (we throttle recoveries now to prevent this type of 
thing). So that either needed to be removed, or made more efficient by not 
linking every http call to a document fail. I think it's been removed or else 
it's broken.

bq. this is good in real life because it's less dependent on healthy 
network/http requests

We already had ZK based LIR on top of the http request attempt. I think the 
rewritten improved LIR removed (rather than making efficient) or broke the 
request attempt.

bq. this is bad in tests because there is an inherent and hard to predict delay 
the replica even realizes it needs to go into recovery

It depends on the test. If you don't want flakey tests, all of them should obey 
the rules of the system when checking things as much as possible. More 
practically, the changed behavior mostly affects us injecting fails. That type 
of test should be isolated and have correct checking. For the rest of the 
tests, we probably don't expect fails and so failing if we have them seems 
fine, something likely needs to be fixed or you are checking wrong.

bq. I haven't dug into your patch that deep, but so far is seems really 
hackish? 

markmiller: Here is a hack to that test.

This is just to fix your test.

bq.  it makes the test wait (or timeout) until it is consistent

If you want to write a test like that, those are the rules, so that is what it 
does. Recovery can be re-triggered and stuff can happen that will take a 
consistent state longer than you might think it should take. So either your 
test is not creating the env you think it is, or it is, and this is how you 
properly test it.

> Need reliable example (Test) of how to use TestInjection.failReplicaRequests
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-13189
>                 URL: https://issues.apache.org/jira/browse/SOLR-13189
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>         Attachments: SOLR-13189.patch, SOLR-13189.patch, SOLR-13189.patch
>
>
> We need a test that reliably demonstrates the usage of 
> {{TestInjection.failReplicaRequests}} and shows what steps a test needs to 
> take after issuing updates to reliably "pass" (finding all index updates that 
> succeeded from the clients perspective) even in the event of an (injected) 
> replica failure.
> As things stand now, it does not seem that any test using 
> {{TestInjection.failReplicaRequests}} passes reliably -- *and it's not clear 
> if this is due to poorly designed tests, or an indication of a bug in 
> distributed updates / LIR*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-13189) Need reliable example (Test) of how to use TestInjection.failReplicaRequests

Reply via email to