[ 
https://issues.apache.org/jira/browse/SOLR-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704663#comment-14704663
 ] 

Adrian Fitzpatrick commented on SOLR-7021:
------------------------------------------

Also have seen this issue on Solr 4.10.3, on a 3 node cluster. Issue affected 
one of 3 collections only, and each of the 3 collections configured with 5 
shards and 3 replicas. In the affected collection, for each of the 5 shards, 
the leader was on the same node (hadoopnode02) and was showing as down for all 
shards. Other replicas for each shard were reporting that were waiting for 
leader (eg "I was asked to wait on state recovering for shard3 in 
the_collection_20150818161800 on hadoopnode01:8983_solr but I still do not see 
the requested state. I see state: recovering live:true leader from ZK: 
http://hadoopnode02:8983/solr/the_collection_20150818161800_shard3_replica2";)

Something like the work-around suggested by Andrey worked - we shut down the 
whole cluster, brought back up all nodes except the one which was reporting 
leader errors (hadoopnode02). This seemed to trigger a leader election but 
without a quorum. Then brought up hadoopnode02 - election then completed 
successfully and cluster state returned to normal.




> Leader will not publish core as active without recovering first, but never 
> recovers
> -----------------------------------------------------------------------------------
>
>                 Key: SOLR-7021
>                 URL: https://issues.apache.org/jira/browse/SOLR-7021
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.10
>            Reporter: James Hardwick
>            Priority: Critical
>              Labels: recovery, solrcloud, zookeeper
>
> A little background: 1 core solr-cloud cluster across 3 nodes, each with its 
> own shard and each shard with a single replica hence each replica is itself a 
> leader. 
> For reasons we won't get into, we witnessed a shard go down in our cluster. 
> We restarted the cluster but our core/shards still did not come back up. 
> After inspecting the logs, we found this:
> {code}
> 015-01-21 15:51:56,494 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - We are http://xxx.xxx.xxx.35:8081/solr/xyzcore/ and leader is 
> http://xxx.xxx.xxx.35:8081/solr/xyzcore/
> 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - No LogReplay needed for core=xyzcore baseURL=http://xxx.xxx.xxx.35:8081/solr
> 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - I am the leader, no recovery necessary
> 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - publishing core=xyzcore state=active collection=xyzcore
> 2015-01-21 15:51:56,497 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - numShards not found on descriptor - reading it from system property
> 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - publishing core=xyzcore state=down collection=xyzcore
> 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - numShards not found on descriptor - reading it from system property
> 2015-01-21 15:51:56,501 [coreZkRegister-1-thread-2] ERROR core.ZkContainer  - 
> :org.apache.solr.common.SolrException: Cannot publish state of core 'xyzcore' 
> as active without recovering first!
>       at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075)
> {code}
> And at this point the necessary shards never recover correctly and hence our 
> core never returns to a functional state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to