[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025553#comment-16025553
 ] 

Abraham Fine commented on ZOOKEEPER-2791:
-----------------------------------------

Hi [~mheffner]-

Thanks for reporting this issue and uploading logs.

I have been trying to reproduce the issue with both 3.3.6 and 3.4.8 and have 
been unsuccessful. I have been reproducing the issue by changing `if 
((request.zxid & 0xffffffffL) == 0xffffffffL) {` to `if ((request.zxid & 
0xffffffffL) == SOME_SMALLER_VALUE) {` to force a leader election, and in my 
testing, ZooKeeper has handled it properly.

I was wondering if you had additional logs that showed what was happening while 
the cluster is down. As far as I can tell the uploaded logs cover only a second 
and are from only 2 machines. Would it be possible logs for the first few 
minutes after the rollover from all the machines in the cluster? It would be 
great to see all of the leader election messages that are being exchanged. 

Thanks,
Abe

> Quorum doesn't recover after zxid rollover
> ------------------------------------------
>
>                 Key: ZOOKEEPER-2791
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2791
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection, quorum
>    Affects Versions: 3.3.6, 3.4.8
>         Environment: Ubuntu 14.04.4 LTS, AWS EC2, 5 node ensembles
>            Reporter: Mike Heffner
>            Assignee: Abraham Fine
>
> When zxid rolls over the ensemble is unable to recover without manually 
> restarting the cluster. The leader enters shutdown() state when zxid rolls 
> over, but the remaining four nodes in the ensemble are not able to re-elect a 
> new leader. This state has persisted for at least 15 minutes before an 
> operator manually restarted the cluster and the ensemble recovered.
> Config:
> --------
> tickTime=2000
> initLimit=10
> syncLimit=5
> dataDir=/raid0/zookeeper
> clientPort=2181
> maxClientCnxns=100
> autopurge.snapRetainCount=14
> autopurge.purgeInterval=24
> leaderServes: True
> server.7=172.26.134.88:2888:3888
> server.6=172.26.136.143:2888:3888
> server.5=172.26.135.103:2888:3888
> server.4=172.26.134.16:2888:3888
> server.9=172.26.135.19:2888:3888
> Logs:
> https://gist.github.com/mheffner/d615d358d4a360ae56a0d0a280040640



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to