[ 
https://issues.apache.org/jira/browse/SOLR-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-10914:
-----------------------------------------
    Attachment: SOLR-10914.patch

I added a TestPrepRecovery to this patch. It has two tests:
# Test that prep recovery eventually succeeds when the leader doesn't respond 
at all. This is an explicit test to protect against regressions of SOLR-9716
# Test that prep recovery succeeds within 90s (default timeout of waitForState) 
when the leader is unloaded. The test doesn't actually wait for 90s unless 
there is a regression but it can wait upto 23 seconds for the second recovery 
attempt which will succeed because the fault injection is done only once. Also 
the fault injection is only done 30% of the time.

This is ready.

> RecoveryStrategy's sendPrepRecoveryCmd can get stuck for 5 minutes if leader 
> is unloaded
> ----------------------------------------------------------------------------------------
>
>                 Key: SOLR-10914
>                 URL: https://issues.apache.org/jira/browse/SOLR-10914
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 6.4, 6.5, 6.6
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: master (7.0)
>
>         Attachments: SOLR-10914.patch, SOLR-10914.patch
>
>
> tl;dr; a recovering replica is stuck for 5 minutes in the prep recovery 
> request if the leader core is unloaded before the prep recovery request is 
> made.
> SOLR-9716 changed the sendPrepRecoveryCmd to retry on read timeouts (earlier 
> it had no connection/read timeout at all) but the fix has caused another 
> problem. Say 
> # A replica starts up (or is newly created) and goes into recovery, 
> # Replica finds that leader=X
> # The core X is unloaded but the node that used to host X is still running 
> and taking requests
> # Replica calls sendPrepRecoveryCmd to X
> At this point, the node X receives the prep recovery command, finds that the 
> core X does not exist and keeps checking again in a sleep-loop until a 
> timeout happens. I am not sure why prep recovery core admin command needs to 
> continue waiting if a local core does not exist. The default timeout here is 
> usually longer than 10 seconds.
> On the recovering replica's side, the prep recovery has a connection/read 
> timeout of only 10s, so the request always times out and is retried upto 5 
> minutes. Only then does the recovery attempt fails and may be restarted again 
> with the right leader URL.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to