[ 
https://issues.apache.org/jira/browse/SOLR-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13803573#comment-13803573
 ] 

Mark Miller commented on SOLR-5373:
-----------------------------------

I think this might be by design?

If you stop node a before b can recover from it, we can't know that b is up to 
date. So the cluster won't serve - it wants you to know that you should restart 
the cluster and make sure node a is involved in the startup if you can so that 
we don't miss any data. If you decide to restart with just node b, at least you 
have to make that choice explicitly.

If you wait for node b to recover and become leader before stopping node a, it 
should work fine. If you need to be able to survive this exact scenario, you 
need more replicas.

> Can't become leader due infinite recovery loop
> ----------------------------------------------
>
>                 Key: SOLR-5373
>                 URL: https://issues.apache.org/jira/browse/SOLR-5373
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.2
>         Environment: SolrCloud, 2 nodes, Fedora
>            Reporter: Javier Mendez
>            Assignee: Mark Miller
>            Priority: Minor
>              Labels: Recovery, SolrCloud
>             Fix For: 4.6, 5.0
>
>         Attachments: stack1, stack2, stack3, stack4, stack5, stack6, stack7
>
>
> We found an issue while performing stability tests on SolrCloud. Under 
> certain circumstances, a node will get in an endless loop trying to recover. 
> I've seen this happen in a two node setup, by following these steps:
> 1) Node A started
> 2) Node B started
> 3) Node B stopped
> 4) Node B started, and immediately Node A stopped (normal graceful shutdown). 
> At this point node B will throw connection refused messages while trying to 
> sync to node A. For some reason (not always) this leads to a corrupt state 
> where node B enters an infinite loop trying to recover from node A (it still 
> thinks the cluster has two nodes). I think the leader election process 
> started just fine, but since recovery is running async, at some point node B 
> published it state as recovery failed, hence causing leader election to fail.
> Zookeeper /live_nodes has only one file.
> This shows on the logs:
>     0:57:18,960 INFO INFO  [ShardLeaderElectionContext] (main-EventThread) 
> Running the leader process.
>     10:57:19,068 INFO INFO  [ShardLeaderElectionContext] (main-EventThread) 
> Checking if I should try and be the leader.
>     10:57:19,068 INFO INFO  [ShardLeaderElectionContext] (main-EventThread) 
> My last published State was recovery_failed, I won't be the leader.
>     10:57:19,068 INFO INFO  [ShardLeaderElectionContext] (main-EventThread) 
> There may be a better leader candidate than us - going back into recovery
>     10:57:19,118 INFO INFO  [DefaultSolrCoreState] (main-EventThread) Running 
> recovery - first canceling any ongoing recovery
>     10:57:19,118 WARN WARN  [RecoveryStrategy] (main-EventThread) Stopping 
> recovery for zkNodeName=10.50.100.30:8998_solr_myCollectioncore=myCollection
>     10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Error while 
> trying to recover. core=myCollection:org.apache.solr.common.SolrException: No 
> registered leader was found, collection:myCollection slice:shard1
>             at 
> org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:484)
>             at 
> org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:467)
>             at 
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:321)
>             at 
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:223)
>     
>     10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery 
> failed - trying again... (0) core=myCollection
>     10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery 
> failed - interrupted. core=myCollection
>     10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery 
> failed - I give up. core=myCollection
>     10:57:19,869 INFO INFO  [ZkController] (RecoveryThread) publishing 
> core=myCollection state=recovery_failed
>     10:57:19,869 INFO INFO  [ZkController] (RecoveryThread) numShards not 
> found on descriptor - reading it from system property
>     10:57:19,902 WARN WARN  [RecoveryStrategy] (RecoveryThread) Stopping 
> recovery for zkNodeName=10.50.100.30:8998_solr_myCollectioncore=myCollection
>     10:57:19,902 INFO INFO  [RecoveryStrategy] (RecoveryThread) Finished 
> recovery process. core=myCollection
>     10:57:19,902 INFO INFO  [RecoveryStrategy] (RecoveryThread) Starting 
> recovery process.  core=myCollection recoveringAfterStartup=false
> Solr Version: 4.2.1.2013.03.26.08.26.55
> Other references to the same issue:
>  - 
> https://support.lucidworks.com/entries/23553611-Solr-cluster-not-able-to-recover
>  
>  - 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201306.mbox/%[email protected]%3E



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to