[jira] [Created] (SOLR-5373) Can't become leader due infinite recovery loop

Javier Mendez (JIRA) Mon, 21 Oct 2013 10:05:22 -0700

Javier Mendez created SOLR-5373:
-----------------------------------

             Summary: Can't become leader due infinite recovery loop
                 Key: SOLR-5373
                 URL: https://issues.apache.org/jira/browse/SOLR-5373
             Project: Solr
          Issue Type: Bug
    Affects Versions: 4.2
         Environment: SolrCloud, 2 nodes, Fedora
            Reporter: Javier Mendez
            Priority: Minor



We found an issue while performing stability tests on SolrCloud. Under certain 
circumstances, a node will get in an endless loop trying to recover. I've seen 
this happen in a two node setup, by following these steps:

1) Node A started
2) Node B started
3) Node B stopped
4) Node B started, and immediately Node A stopped (normal graceful shutdown). 

At this point node B will throw connection refused messages while trying to 
sync to node A. For some reason (not always) this leads to a corrupt state 
where node B enters an infinite loop trying to recover from node A (it still 
thinks the cluster has two nodes). I think the leader election process started 
just fine, but since recovery is running async, at some point node B published 
it state as recovery failed, hence causing leader election to fail.

Zookeeper /live_nodes has only one file.

This shows on the logs:
    0:57:18,960 INFO INFO  [ShardLeaderElectionContext] (main-EventThread) 
Running the leader process.
    10:57:19,068 INFO INFO  [ShardLeaderElectionContext] (main-EventThread) 
Checking if I should try and be the leader.
    10:57:19,068 INFO INFO  [ShardLeaderElectionContext] (main-EventThread) My 
last published State was recovery_failed, I won't be the leader.
    10:57:19,068 INFO INFO  [ShardLeaderElectionContext] (main-EventThread) 
There may be a better leader candidate than us - going back into recovery
    10:57:19,118 INFO INFO  [DefaultSolrCoreState] (main-EventThread) Running 
recovery - first canceling any ongoing recovery
    10:57:19,118 WARN WARN  [RecoveryStrategy] (main-EventThread) Stopping 
recovery for zkNodeName=10.50.100.30:8998_solr_myCollectioncore=myCollection
    10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Error while 
trying to recover. core=myCollection:org.apache.solr.common.SolrException: No 
registered leader was found, collection:myCollection slice:shard1
            at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:484)
            at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:467)
            at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:321)
            at 
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:223)
    
    10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery 
failed - trying again... (0) core=myCollection
    10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery 
failed - interrupted. core=myCollection
    10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery 
failed - I give up. core=myCollection
    10:57:19,869 INFO INFO  [ZkController] (RecoveryThread) publishing 
core=myCollection state=recovery_failed
    10:57:19,869 INFO INFO  [ZkController] (RecoveryThread) numShards not found 
on descriptor - reading it from system property
    10:57:19,902 WARN WARN  [RecoveryStrategy] (RecoveryThread) Stopping 
recovery for zkNodeName=10.50.100.30:8998_solr_myCollectioncore=myCollection
    10:57:19,902 INFO INFO  [RecoveryStrategy] (RecoveryThread) Finished 
recovery process. core=myCollection
    10:57:19,902 INFO INFO  [RecoveryStrategy] (RecoveryThread) Starting 
recovery process.  core=myCollection recoveringAfterStartup=false

Solr Version: 4.2.1.2013.03.26.08.26.55

Other references to the same issue:

 - 
https://support.lucidworks.com/entries/23553611-Solr-cluster-not-able-to-recover
 
 - 
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201306.mbox/%[email protected]%3E




--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SOLR-5373) Can't become leader due infinite recovery loop

Reply via email to