[ https://issues.apache.org/jira/browse/SOLR-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shalin Shekhar Mangar resolved SOLR-5373. ----------------------------------------- Resolution: Not A Problem As noted by Mark, this is by design. > Can't become leader due infinite recovery loop > ---------------------------------------------- > > Key: SOLR-5373 > URL: https://issues.apache.org/jira/browse/SOLR-5373 > Project: Solr > Issue Type: Bug > Affects Versions: 4.2 > Environment: SolrCloud, 2 nodes, Fedora > Reporter: Javier Mendez > Assignee: Mark Miller > Priority: Minor > Labels: Recovery, SolrCloud > Fix For: 4.7 > > Attachments: SOLR-5373.patch, stack1, stack2, stack3, stack4, stack5, > stack6, stack7 > > > We found an issue while performing stability tests on SolrCloud. Under > certain circumstances, a node will get in an endless loop trying to recover. > I've seen this happen in a two node setup, by following these steps: > 1) Node A started > 2) Node B started > 3) Node B stopped > 4) Node B started, and immediately Node A stopped (normal graceful shutdown). > At this point node B will throw connection refused messages while trying to > sync to node A. For some reason (not always) this leads to a corrupt state > where node B enters an infinite loop trying to recover from node A (it still > thinks the cluster has two nodes). I think the leader election process > started just fine, but since recovery is running async, at some point node B > published it state as recovery failed, hence causing leader election to fail. > Zookeeper /live_nodes has only one file. > This shows on the logs: > 0:57:18,960 INFO INFO [ShardLeaderElectionContext] (main-EventThread) > Running the leader process. > 10:57:19,068 INFO INFO [ShardLeaderElectionContext] (main-EventThread) > Checking if I should try and be the leader. > 10:57:19,068 INFO INFO [ShardLeaderElectionContext] (main-EventThread) > My last published State was recovery_failed, I won't be the leader. > 10:57:19,068 INFO INFO [ShardLeaderElectionContext] (main-EventThread) > There may be a better leader candidate than us - going back into recovery > 10:57:19,118 INFO INFO [DefaultSolrCoreState] (main-EventThread) Running > recovery - first canceling any ongoing recovery > 10:57:19,118 WARN WARN [RecoveryStrategy] (main-EventThread) Stopping > recovery for zkNodeName=10.50.100.30:8998_solr_myCollectioncore=myCollection > 10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Error while > trying to recover. core=myCollection:org.apache.solr.common.SolrException: No > registered leader was found, collection:myCollection slice:shard1 > at > org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:484) > at > org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:467) > at > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:321) > at > org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:223) > > 10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery > failed - trying again... (0) core=myCollection > 10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery > failed - interrupted. core=myCollection > 10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery > failed - I give up. core=myCollection > 10:57:19,869 INFO INFO [ZkController] (RecoveryThread) publishing > core=myCollection state=recovery_failed > 10:57:19,869 INFO INFO [ZkController] (RecoveryThread) numShards not > found on descriptor - reading it from system property > 10:57:19,902 WARN WARN [RecoveryStrategy] (RecoveryThread) Stopping > recovery for zkNodeName=10.50.100.30:8998_solr_myCollectioncore=myCollection > 10:57:19,902 INFO INFO [RecoveryStrategy] (RecoveryThread) Finished > recovery process. core=myCollection > 10:57:19,902 INFO INFO [RecoveryStrategy] (RecoveryThread) Starting > recovery process. core=myCollection recoveringAfterStartup=false > Solr Version: 4.2.1.2013.03.26.08.26.55 > Other references to the same issue: > - > https://support.lucidworks.com/entries/23553611-Solr-cluster-not-able-to-recover > > - > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201306.mbox/%3c1371473296754-4070983.p...@n3.nabble.com%3E -- This message was sent by Atlassian JIRA (v6.1.5#6160) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org