Jessica Cheng created SOLR-5952:
-----------------------------------

             Summary: Recovery race/ error
                 Key: SOLR-5952
                 URL: https://issues.apache.org/jira/browse/SOLR-5952
             Project: Solr
          Issue Type: Bug
          Components: SolrCloud
    Affects Versions: 4.7
            Reporter: Jessica Cheng


We're seeing some shard recovery errors in our cluster when a zookeeper "error 
event" happened. In this particular case, we had two replicas. The event from 
the logs look roughly like this:

18:40:36 follower (host2) disconnected from zk
18:40:38 original leader started recovery (there was no log about why it needed 
recovery though) and failed because cluster state still says it's the leader
18:40:39 follower successfully connected to zk after some trouble
19:03:35 follower register core/replica
19:16:36 follower registration fails due to no leader (NoNode for 
/collections/test-1/leaders/shard2)

Essentially, I think the problem is that the isLeader property on the cluster 
state is never cleaned up, so neither replicas are able to recover/register in 
order to participate in leader election again.

Looks like from the code that the only place that the isLeader property is 
cleared from the cluster state is from ElectionContext.runLeaderProcess, which 
assumes that the replica with the next election seqId will notice the leader's 
node disappearing and run the leader process. This assumption fails in this 
scenario because the follower experienced the same zookeeper "error event" and 
never handled the event of the leader going away. (Mark, this is where I was 
saying in SOLR-3582 that maybe the watcher in LeaderElector.checkIfIamLeader 
does need to handle "Expired" by somehow realizing that the leader is gone and 
clearing the isLeader state at least, but it currently ignores all 
EventType.None events.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to