[
https://issues.apache.org/jira/browse/SOLR-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Miller updated SOLR-5373:
------------------------------
Attachment: SOLR-5373.patch
Here is an example unit test that allows you to play around with this.
If you remove the line that waits for node b to recover when it comes back up,
the test will often fail because node a goes down before node b can become the
leader. With the line, the test should pass.
> Can't become leader due infinite recovery loop
> ----------------------------------------------
>
> Key: SOLR-5373
> URL: https://issues.apache.org/jira/browse/SOLR-5373
> Project: Solr
> Issue Type: Bug
> Affects Versions: 4.2
> Environment: SolrCloud, 2 nodes, Fedora
> Reporter: Javier Mendez
> Assignee: Mark Miller
> Priority: Minor
> Labels: Recovery, SolrCloud
> Fix For: 4.6, 5.0
>
> Attachments: SOLR-5373.patch, stack1, stack2, stack3, stack4, stack5,
> stack6, stack7
>
>
> We found an issue while performing stability tests on SolrCloud. Under
> certain circumstances, a node will get in an endless loop trying to recover.
> I've seen this happen in a two node setup, by following these steps:
> 1) Node A started
> 2) Node B started
> 3) Node B stopped
> 4) Node B started, and immediately Node A stopped (normal graceful shutdown).
> At this point node B will throw connection refused messages while trying to
> sync to node A. For some reason (not always) this leads to a corrupt state
> where node B enters an infinite loop trying to recover from node A (it still
> thinks the cluster has two nodes). I think the leader election process
> started just fine, but since recovery is running async, at some point node B
> published it state as recovery failed, hence causing leader election to fail.
> Zookeeper /live_nodes has only one file.
> This shows on the logs:
> 0:57:18,960 INFO INFO [ShardLeaderElectionContext] (main-EventThread)
> Running the leader process.
> 10:57:19,068 INFO INFO [ShardLeaderElectionContext] (main-EventThread)
> Checking if I should try and be the leader.
> 10:57:19,068 INFO INFO [ShardLeaderElectionContext] (main-EventThread)
> My last published State was recovery_failed, I won't be the leader.
> 10:57:19,068 INFO INFO [ShardLeaderElectionContext] (main-EventThread)
> There may be a better leader candidate than us - going back into recovery
> 10:57:19,118 INFO INFO [DefaultSolrCoreState] (main-EventThread) Running
> recovery - first canceling any ongoing recovery
> 10:57:19,118 WARN WARN [RecoveryStrategy] (main-EventThread) Stopping
> recovery for zkNodeName=10.50.100.30:8998_solr_myCollectioncore=myCollection
> 10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Error while
> trying to recover. core=myCollection:org.apache.solr.common.SolrException: No
> registered leader was found, collection:myCollection slice:shard1
> at
> org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:484)
> at
> org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:467)
> at
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:321)
> at
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:223)
>
> 10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery
> failed - trying again... (0) core=myCollection
> 10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery
> failed - interrupted. core=myCollection
> 10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery
> failed - I give up. core=myCollection
> 10:57:19,869 INFO INFO [ZkController] (RecoveryThread) publishing
> core=myCollection state=recovery_failed
> 10:57:19,869 INFO INFO [ZkController] (RecoveryThread) numShards not
> found on descriptor - reading it from system property
> 10:57:19,902 WARN WARN [RecoveryStrategy] (RecoveryThread) Stopping
> recovery for zkNodeName=10.50.100.30:8998_solr_myCollectioncore=myCollection
> 10:57:19,902 INFO INFO [RecoveryStrategy] (RecoveryThread) Finished
> recovery process. core=myCollection
> 10:57:19,902 INFO INFO [RecoveryStrategy] (RecoveryThread) Starting
> recovery process. core=myCollection recoveringAfterStartup=false
> Solr Version: 4.2.1.2013.03.26.08.26.55
> Other references to the same issue:
> -
> https://support.lucidworks.com/entries/23553611-Solr-cluster-not-able-to-recover
>
> -
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201306.mbox/%[email protected]%3E
--
This message was sent by Atlassian JIRA
(v6.1#6144)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]