[ 
https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959364#comment-13959364
 ] 

Jessica Cheng commented on SOLR-5952:
-------------------------------------

Hi Daniel,

{quote}
I know there have been issues if the follower disconnected from ZK, then it 
will fail to take updates from the leader (since it can't confirm the source of 
the messages is the real leader), so the follower will get asked to recover, 
and will have to wait until it has a valid ZK connection in order to do that. 
But I believe there have been fixes around that area.
{quote}
What you describe doesn't seem to be related to this case. In this case, when 
the follower finally connected to zk again, there was no leader at all, and it 
fails to register itself when it hits the NoNodeException on  
/collections/test-1/leaders/shard2 to find the leader. It neither got to 
re-join the election nor to recover.

{quote}
In the example logs here though (I'm assuming host 1 was the leader) host1 says 
that its last published state was down? We might need to go further back in the 
trace history of that node, why did it publish itself as down but was still 
leader?
{quote}
Yes, this is where both Mark and I were expressing confusion about. However, I 
went back in the logs for hours trying to find the core being marked as down 
and I couldn't find it. (I grepped for "publishing core" from 
ZkController.publish.)

> Recovery race/ error
> --------------------
>
>                 Key: SOLR-5952
>                 URL: https://issues.apache.org/jira/browse/SOLR-5952
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.7
>            Reporter: Jessica Cheng
>            Assignee: Mark Miller
>              Labels: leader, recovery, solrcloud, zookeeper
>             Fix For: 4.8, 5.0
>
>         Attachments: recovery-failure-host1-log.txt, 
> recovery-failure-host2-log.txt
>
>
> We're seeing some shard recovery errors in our cluster when a zookeeper 
> "error event" happened. In this particular case, we had two replicas. The 
> event from the logs look roughly like this:
> 18:40:36 follower (host2) disconnected from zk
> 18:40:38 original leader started recovery (there was no log about why it 
> needed recovery though) and failed because cluster state still says it's the 
> leader
> 18:40:39 follower successfully connected to zk after some trouble
> 19:03:35 follower register core/replica
> 19:16:36 follower registration fails due to no leader (NoNode for 
> /collections/test-1/leaders/shard2)
> Essentially, I think the problem is that the isLeader property on the cluster 
> state is never cleaned up, so neither replicas are able to recover/register 
> in order to participate in leader election again.
> Looks like from the code that the only place that the isLeader property is 
> cleared from the cluster state is from ElectionContext.runLeaderProcess, 
> which assumes that the replica with the next election seqId will notice the 
> leader's node disappearing and run the leader process. This assumption fails 
> in this scenario because the follower experienced the same zookeeper "error 
> event" and never handled the event of the leader going away. (Mark, this is 
> where I was saying in SOLR-3582 that maybe the watcher in 
> LeaderElector.checkIfIamLeader does need to handle "Expired" by somehow 
> realizing that the leader is gone and clearing the isLeader state at least, 
> but it currently ignores all EventType.None events.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to