[
https://issues.apache.org/jira/browse/ZOOKEEPER-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025553#comment-16025553
]
Abraham Fine commented on ZOOKEEPER-2791:
-----------------------------------------
Hi [~mheffner]-
Thanks for reporting this issue and uploading logs.
I have been trying to reproduce the issue with both 3.3.6 and 3.4.8 and have
been unsuccessful. I have been reproducing the issue by changing `if
((request.zxid & 0xffffffffL) == 0xffffffffL) {` to `if ((request.zxid &
0xffffffffL) == SOME_SMALLER_VALUE) {` to force a leader election, and in my
testing, ZooKeeper has handled it properly.
I was wondering if you had additional logs that showed what was happening while
the cluster is down. As far as I can tell the uploaded logs cover only a second
and are from only 2 machines. Would it be possible logs for the first few
minutes after the rollover from all the machines in the cluster? It would be
great to see all of the leader election messages that are being exchanged.
Thanks,
Abe
> Quorum doesn't recover after zxid rollover
> ------------------------------------------
>
> Key: ZOOKEEPER-2791
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2791
> Project: ZooKeeper
> Issue Type: Bug
> Components: leaderElection, quorum
> Affects Versions: 3.3.6, 3.4.8
> Environment: Ubuntu 14.04.4 LTS, AWS EC2, 5 node ensembles
> Reporter: Mike Heffner
> Assignee: Abraham Fine
>
> When zxid rolls over the ensemble is unable to recover without manually
> restarting the cluster. The leader enters shutdown() state when zxid rolls
> over, but the remaining four nodes in the ensemble are not able to re-elect a
> new leader. This state has persisted for at least 15 minutes before an
> operator manually restarted the cluster and the ensemble recovered.
> Config:
> --------
> tickTime=2000
> initLimit=10
> syncLimit=5
> dataDir=/raid0/zookeeper
> clientPort=2181
> maxClientCnxns=100
> autopurge.snapRetainCount=14
> autopurge.purgeInterval=24
> leaderServes: True
> server.7=172.26.134.88:2888:3888
> server.6=172.26.136.143:2888:3888
> server.5=172.26.135.103:2888:3888
> server.4=172.26.134.16:2888:3888
> server.9=172.26.135.19:2888:3888
> Logs:
> https://gist.github.com/mheffner/d615d358d4a360ae56a0d0a280040640
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)