[
https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Miller reassigned SOLR-5952:
---------------------------------
Assignee: Mark Miller
> Recovery race/ error
> --------------------
>
> Key: SOLR-5952
> URL: https://issues.apache.org/jira/browse/SOLR-5952
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 4.7
> Reporter: Jessica Cheng
> Assignee: Mark Miller
> Labels: leader, recovery, solrcloud, zookeeper
> Fix For: 4.8, 5.0
>
> Attachments: recovery-failure-host1-log.txt,
> recovery-failure-host2-log.txt
>
>
> We're seeing some shard recovery errors in our cluster when a zookeeper
> "error event" happened. In this particular case, we had two replicas. The
> event from the logs look roughly like this:
> 18:40:36 follower (host2) disconnected from zk
> 18:40:38 original leader started recovery (there was no log about why it
> needed recovery though) and failed because cluster state still says it's the
> leader
> 18:40:39 follower successfully connected to zk after some trouble
> 19:03:35 follower register core/replica
> 19:16:36 follower registration fails due to no leader (NoNode for
> /collections/test-1/leaders/shard2)
> Essentially, I think the problem is that the isLeader property on the cluster
> state is never cleaned up, so neither replicas are able to recover/register
> in order to participate in leader election again.
> Looks like from the code that the only place that the isLeader property is
> cleared from the cluster state is from ElectionContext.runLeaderProcess,
> which assumes that the replica with the next election seqId will notice the
> leader's node disappearing and run the leader process. This assumption fails
> in this scenario because the follower experienced the same zookeeper "error
> event" and never handled the event of the leader going away. (Mark, this is
> where I was saying in SOLR-3582 that maybe the watcher in
> LeaderElector.checkIfIamLeader does need to handle "Expired" by somehow
> realizing that the leader is gone and clearing the isLeader state at least,
> but it currently ignores all EventType.None events.)
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]