[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1609#comment-1609 ] Benedict Jin commented on ZOOKEEPER-1277: - I created a new jira ZOOKEEPER-2789 to discuss reassign `ZXID` for solving 32bit overflow problem. Could you please offer some advice for it? > servers stop serving when lower 32bits of zxid roll over > > > Key: ZOOKEEPER-1277 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.3.3 >Reporter: Patrick Hunt >Assignee: Patrick Hunt >Priority: Critical > Fix For: 3.3.5, 3.4.4, 3.5.0 > > Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, > ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, > ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, > ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch > > > When the lower 32bits of a zxid "roll over" (zxid is a 64 bit number, however > the upper 32 are considered the epoch number) the epoch number (upper 32 > bits) are incremented and the lower 32 start at 0 again. > This should work fine, however in the current 3.3 branch the followers see > this as a NEWLEADER message, which it's not, and effectively stop serving > clients. Attached clients seem to eventually time out given that heartbeats > (or any operation) are no longer processed. The follower doesn't recover from > this. > I've tested this out on 3.3 branch and confirmed this problem, however I > haven't tried it on 3.4/3.5. It may not happen on the newer branches due to > ZOOKEEPER-335, however there is certainly an issue with updating the > "acceptedEpoch" files contained in the datadir. (I'll enter a separate jira > for that) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1607#comment-1607 ] Benedict Jin commented on ZOOKEEPER-1277: - I see. Thank you! :D > servers stop serving when lower 32bits of zxid roll over > > > Key: ZOOKEEPER-1277 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.3.3 >Reporter: Patrick Hunt >Assignee: Patrick Hunt >Priority: Critical > Fix For: 3.3.5, 3.4.4, 3.5.0 > > Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, > ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, > ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, > ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch > > > When the lower 32bits of a zxid "roll over" (zxid is a 64 bit number, however > the upper 32 are considered the epoch number) the epoch number (upper 32 > bits) are incremented and the lower 32 start at 0 again. > This should work fine, however in the current 3.3 branch the followers see > this as a NEWLEADER message, which it's not, and effectively stop serving > clients. Attached clients seem to eventually time out given that heartbeats > (or any operation) are no longer processed. The follower doesn't recover from > this. > I've tested this out on 3.3 branch and confirmed this problem, however I > haven't tried it on 3.4/3.5. It may not happen on the newer branches due to > ZOOKEEPER-335, however there is certainly an issue with updating the > "acceptedEpoch" files contained in the datadir. (I'll enter a separate jira > for that) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1601#comment-1601 ] Patrick Hunt commented on ZOOKEEPER-1277: - [~benedict jin] Not sure I follow that question. I believe it should be ok to add a new server during a re-election, even if that election were triggered by a epoch overflow. I've never tried that however. > servers stop serving when lower 32bits of zxid roll over > > > Key: ZOOKEEPER-1277 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.3.3 >Reporter: Patrick Hunt >Assignee: Patrick Hunt >Priority: Critical > Fix For: 3.3.5, 3.4.4, 3.5.0 > > Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, > ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, > ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, > ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch > > > When the lower 32bits of a zxid "roll over" (zxid is a 64 bit number, however > the upper 32 are considered the epoch number) the epoch number (upper 32 > bits) are incremented and the lower 32 start at 0 again. > This should work fine, however in the current 3.3 branch the followers see > this as a NEWLEADER message, which it's not, and effectively stop serving > clients. Attached clients seem to eventually time out given that heartbeats > (or any operation) are no longer processed. The follower doesn't recover from > this. > I've tested this out on 3.3 branch and confirmed this problem, however I > haven't tried it on 3.4/3.5. It may not happen on the newer branches due to > ZOOKEEPER-335, however there is certainly an issue with updating the > "acceptedEpoch" files contained in the datadir. (I'll enter a separate jira > for that) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16013819#comment-16013819 ] Benedict Jin commented on ZOOKEEPER-1277: - @Patrick Hunt Hi, Patrick Hunt. If zk happend `32bits` overflow and force a leader re-election, but at the same time run the command `zkServer.sh start` from outside by my `keep alive` shell script. Is there could be a problem? > servers stop serving when lower 32bits of zxid roll over > > > Key: ZOOKEEPER-1277 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.3.3 >Reporter: Patrick Hunt >Assignee: Patrick Hunt >Priority: Critical > Fix For: 3.3.5, 3.4.4, 3.5.0 > > Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, > ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, > ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, > ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch > > > When the lower 32bits of a zxid "roll over" (zxid is a 64 bit number, however > the upper 32 are considered the epoch number) the epoch number (upper 32 > bits) are incremented and the lower 32 start at 0 again. > This should work fine, however in the current 3.3 branch the followers see > this as a NEWLEADER message, which it's not, and effectively stop serving > clients. Attached clients seem to eventually time out given that heartbeats > (or any operation) are no longer processed. The follower doesn't recover from > this. > I've tested this out on 3.3 branch and confirmed this problem, however I > haven't tried it on 3.4/3.5. It may not happen on the newer branches due to > ZOOKEEPER-335, however there is certainly an issue with updating the > "acceptedEpoch" files contained in the datadir. (I'll enter a separate jira > for that) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044976#comment-15044976 ] Flavio Junqueira commented on ZOOKEEPER-1277: - [~frenzzz] The log messages you posted say that it is triggering a new election, which starts a new epoch and consequently resets the zxid. What's the problem you're observing more precisely? > servers stop serving when lower 32bits of zxid roll over > > > Key: ZOOKEEPER-1277 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.3.3 >Reporter: Patrick Hunt >Assignee: Patrick Hunt >Priority: Critical > Fix For: 3.3.5, 3.4.4, 3.5.0 > > Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, > ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, > ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, > ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch > > > When the lower 32bits of a zxid "roll over" (zxid is a 64 bit number, however > the upper 32 are considered the epoch number) the epoch number (upper 32 > bits) are incremented and the lower 32 start at 0 again. > This should work fine, however in the current 3.3 branch the followers see > this as a NEWLEADER message, which it's not, and effectively stop serving > clients. Attached clients seem to eventually time out given that heartbeats > (or any operation) are no longer processed. The follower doesn't recover from > this. > I've tested this out on 3.3 branch and confirmed this problem, however I > haven't tried it on 3.4/3.5. It may not happen on the newer branches due to > ZOOKEEPER-335, however there is certainly an issue with updating the > "acceptedEpoch" files contained in the datadir. (I'll enter a separate jira > for that) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15045368#comment-15045368 ] Alexandr Orlov commented on ZOOKEEPER-1277: --- I mean when zxid "roll over" had occured and new leader election triggred, zookeeper stop serving. Leader activation at our environment took about 30sec and zxid roll over happens about two times per week, what is not pretty good. Would be great, if it possible, to find out some solution for avoiding re-election. > servers stop serving when lower 32bits of zxid roll over > > > Key: ZOOKEEPER-1277 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.3.3 >Reporter: Patrick Hunt >Assignee: Patrick Hunt >Priority: Critical > Fix For: 3.3.5, 3.4.4, 3.5.0 > > Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, > ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, > ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, > ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch > > > When the lower 32bits of a zxid "roll over" (zxid is a 64 bit number, however > the upper 32 are considered the epoch number) the epoch number (upper 32 > bits) are incremented and the lower 32 start at 0 again. > This should work fine, however in the current 3.3 branch the followers see > this as a NEWLEADER message, which it's not, and effectively stop serving > clients. Attached clients seem to eventually time out given that heartbeats > (or any operation) are no longer processed. The follower doesn't recover from > this. > I've tested this out on 3.3 branch and confirmed this problem, however I > haven't tried it on 3.4/3.5. It may not happen on the newer branches due to > ZOOKEEPER-335, however there is certainly an issue with updating the > "acceptedEpoch" files contained in the datadir. (I'll enter a separate jira > for that) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044853#comment-15044853 ] Alexandr Orlov commented on ZOOKEEPER-1277: --- Hi! We still have a problem with versions 3.4.6, 3.5.1-alpha: {noformat} 2015-12-03 11:02:48,073 - ERROR [ProcessThread(sid:5 cport:-1)::org.apache.zookeeper.server.PrepRequestProcessor@139] - Unexpected exception org.apache.zookeeper.server.RequestProcessor$RequestProcessorException: zxid lower 32 bits have rolled over, forcing re-election, and therefore new epoch start at org.apache.zookeeper.server.quorum.ProposalRequestProcessor.processRequest(ProposalRequestProcessor.java:80) at org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:673) at org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:131) Caused by: org.apache.zookeeper.server.quorum.Leader$XidRolloverException: zxid lower 32 bits have rolled over, forcing re-election, and therefore new epoch start at org.apache.zookeeper.server.quorum.Leader.propose(Leader.java:746) at org.apache.zookeeper.server.quorum.ProposalRequestProcessor.processRequest(ProposalRequestProcessor.java:78) ... 2 more 2015-12-03 11:02:48,073 - WARN [LearnerHandler-/2a02:6b8:0:1602:37a6:f71a:79c1:e5f3:48040:org.apache.zookeeper.server.quorum.LearnerHandler@658] - Ignoring unexpected exception java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) at org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:656) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:649) 2015-12-03 11:02:49,766 - INFO [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2183:org.apache.zookeeper.server.quorum.Leader@493] - Shutting down 2015-12-03 11:02:49,766 - INFO [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2183:org.apache.zookeeper.server.quorum.QuorumPeer@714] - LOOKING 2015-12-03 11:02:49,766 - DEBUG [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2183:org.apache.zookeeper.server.quorum.QuorumPeer@645] - Initializing leader election protocol... {noformat} As i seen at https://zookeeper.apache.org/doc/r3.5.1-alpha/releasenotes.html that problem should be resolved, but it isn't > servers stop serving when lower 32bits of zxid roll over > > > Key: ZOOKEEPER-1277 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.3.3 >Reporter: Patrick Hunt >Assignee: Patrick Hunt >Priority: Critical > Fix For: 3.3.5, 3.4.4, 3.5.0 > > Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, > ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, > ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, > ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch > > > When the lower 32bits of a zxid "roll over" (zxid is a 64 bit number, however > the upper 32 are considered the epoch number) the epoch number (upper 32 > bits) are incremented and the lower 32 start at 0 again. > This should work fine, however in the current 3.3 branch the followers see > this as a NEWLEADER message, which it's not, and effectively stop serving > clients. Attached clients seem to eventually time out given that heartbeats > (or any operation) are no longer processed. The follower doesn't recover from > this. > I've tested this out on 3.3 branch and confirmed this problem, however I > haven't tried it on 3.4/3.5. It may not happen on the newer branches due to > ZOOKEEPER-335, however there is certainly an issue with updating the > "acceptedEpoch" files contained in the datadir. (I'll enter a separate jira > for that) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915635#comment-13915635 ] Lu Xuehui commented on ZOOKEEPER-1277: -- when the zixd roll over, the epoch++ ; a new leader arises ,the epoch += 2. this way can avoid throw Exception ? servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Critical Fix For: 3.3.5, 3.4.4, 3.5.0 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633420#comment-13633420 ] Dave Latham commented on ZOOKEEPER-1277: We recently experienced an HBase outage that I believe was caused by this issue. Running on ZK 3.4.4, the log for the leader shows this: {noformat} 2013-04-12 17:46:25,894 INFO org.apache.zookeeper.server.quorum.Leader: Have quorum of supporters; starting up and setting last processed zxid: 0x1a0004 2013-04-12 17:46:25,895 WARN org.apache.zookeeper.server.FinalRequestProcessor: Zxid outstanding 111669149696 is less than current 111669149697 2013-04-12 17:46:25,895 WARN org.apache.zookeeper.server.quorum.LearnerHandler: *** GOODBYE /10.0.1.100:34796 2013-04-12 17:46:25,896 ERROR org.apache.zookeeper.server.NIOServerCnxnFactory: Thread LearnerHandler Socket[addr=/10.0.1.100,port=34796,localport=2888] tickOfLastAck:897811 synced?:true queuedPacketLength:0 died java.lang.IllegalThreadStateException at java.lang.Thread.start(Thread.java:638) at org.apache.zookeeper.server.quorum.LeaderZooKeeperServer.startSessionTracker(LeaderZooKeeperServer.java:87) at org.apache.zookeeper.server.ZooKeeperServer.startup(ZooKeeperServer.java:394) at org.apache.zookeeper.server.quorum.Leader.processAck(Leader.java:531) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:497) {noformat} Immediately after this one of the followers had a new election and became a follower again. Also, the heap on the leader immediately climbed until the process became stuck spending most of its time in GC. At this point HBase region servers started dropping like flies and then the ZK node was killed. I'm adding this comment now for two purposes. First, so that if other people see the same symptom in their logs they may find this issue faster. Second, I'd love to hear from anyone more familiar with ZooKeeper if this issue does indeeed explain the observations I wrote and mentioned above. servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Critical Fix For: 3.3.5, 3.4.4, 3.5.0 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633423#comment-13633423 ] Dave Latham commented on ZOOKEEPER-1277: Excuse me, we were running 3.4.3, not 3.4.4 servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Critical Fix For: 3.3.5, 3.4.4, 3.5.0 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633563#comment-13633563 ] Patrick Hunt commented on ZOOKEEPER-1277: - Hi [~davelatham], it seems unlikely to me. Are you only running hbase against ZK? Because in that case the number of changes to zk are going to be than 4billion (the amount necessary to roll over the lower 32 bits), hbase just doesn't generate that much traffic. I've only seen the rollover case with 10k's of clients doing large numbers of operations per second. hbase just doesn't drive that much traffic - it's mainly for failover and table management. You might have hit an issue with 3.4 that was fixed in a subsequent release. However the symptoms you mentioned don't ring a bell either servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Critical Fix For: 3.3.5, 3.4.4, 3.5.0 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633565#comment-13633565 ] Dave Latham commented on ZOOKEEPER-1277: Thanks for the response, [~phunt]. It is only HBase, but there are 1000 region servers and are using replication which puts much greater load on ZK. Taking a recent sample I see the zxid going up by thousands per second. servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Critical Fix For: 3.3.5, 3.4.4, 3.5.0 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633580#comment-13633580 ] Patrick Hunt commented on ZOOKEEPER-1277: - [~davelatham] this could be it then. 1k's/sec means ~ a month before rollover. servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Critical Fix For: 3.3.5, 3.4.4, 3.5.0 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231062#comment-13231062 ] Hudson commented on ZOOKEEPER-1277: --- Integrated in ZooKeeper-trunk #1493 (See [https://builds.apache.org/job/ZooKeeper-trunk/1493/]) ZOOKEEPER-1277. servers stop serving when lower 32bits of zxid roll over (phunt) (Revision 1301079) Result = SUCCESS phunt : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1301079 Files : * /zookeeper/trunk/CHANGES.txt * /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/PrepRequestProcessor.java * /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/RequestProcessor.java * /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/SyncRequestProcessor.java * /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java * /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java * /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/ProposalRequestProcessor.java * /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/ReadOnlyRequestProcessor.java * /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/ZxidRolloverTest.java * /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/ClientBase.java servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Critical Fix For: 3.3.5, 3.4.4, 3.5.0 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229937#comment-13229937 ] Hadoop QA commented on ZOOKEEPER-1277: -- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12518426/ZOOKEEPER-1277_trunk.patch against trunk revision 1297740. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/994//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/994//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/994//console This message is automatically generated. servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Critical Fix For: 3.3.5, 3.4.4, 3.5.0 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229941#comment-13229941 ] Mahadev konar commented on ZOOKEEPER-1277: -- +1 on the patches. Looked through all 3. Good to go! Thanks Pat! servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Critical Fix For: 3.3.5, 3.4.4, 3.5.0 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229656#comment-13229656 ] Flavio Junqueira commented on ZOOKEEPER-1277: - Ok, I only looked at propose() as you suggested, Pat. That method sounds right: it forces a leader election when we reach the limit. However, I'm not sure how we guarantee that Zab will work correctly under this exception. It is an invariant of the protocol that a follower won't go back to a previous epoch; if we roll over, then followers will have to go back to a previous epoch, no? How do we make sure that it doesn't break the protocol implementation? servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Critical Fix For: 3.3.6 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229672#comment-13229672 ] Patrick Hunt commented on ZOOKEEPER-1277: - That's correct, based on the feedback I got from the previous attempt it was clear that we cannot continue without a re-election. In this case I'm looking for the just about to occur rollover and I'm dropping leadership at that point. The re-election will then happen, a new epoch chosen, and the lower 32bit thereby reset. servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Critical Fix For: 3.3.6 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229704#comment-13229704 ] Patrick Hunt commented on ZOOKEEPER-1277: - haha, yea I hear you. This is not for unit test testing though. I use setZxid for that (see the original test). The system property is to allow QA to test this on a real cluster. I've used this for the first level of verification - I started a 3 node cluster with this system property and used a std client to force the re-election (by creating znodes for example). I can then see that the real servers are operating properly and handle this case - without waiting for a month of writes to go through the system. That make more sense? (I'll update the comment) servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Critical Fix For: 3.3.6 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229708#comment-13229708 ] Mahadev konar commented on ZOOKEEPER-1277: -- Ahh... That makes more sense! Updated comments would be good. Thanks! servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Critical Fix For: 3.3.6 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229906#comment-13229906 ] Hadoop QA commented on ZOOKEEPER-1277: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12518421/ZOOKEEPER-1277_trunk.patch against trunk revision 1297740. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/993//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/993//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/993//console This message is automatically generated. servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Critical Fix For: 3.3.5, 3.4.4, 3.5.0 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_trunk.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149943#comment-13149943 ] Patrick Hunt commented on ZOOKEEPER-1277: - I thought about that but it seemed like a bad idea for 2 reasons I could think of: 1) it would cause all of the clients to disconnect and reconnect unnecessarily, perhaps introducing instability in the process. 2) can we guarantee that the leader will give up leadership? ie how to effect this, exit the JVM on the leader? In talking with Ben about it in the past (perhaps he's since changed his mind) he seemed to think that rolling over to a new epoch number (with no leader re-election) was OK. servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Blocker Fix For: 3.3.4 Attachments: ZOOKEEPER-1277_br33.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149973#comment-13149973 ] Flavio Junqueira commented on ZOOKEEPER-1277: - The scenario I have in mind to say this is incorrect is more or less the following: # Leader L is currently in epoch 3 and it moves to epoch 4 in the way this patch proposes by simply adding 2 to hzxid. The leader proposes a transaction with zxid 4,1, which is acknowledged by some follower F, but not a quorum; # Concurrently, a new leader L' arises and selects 4 as its epoch (it hasn't talked to L or F); # L' proposes a transaction with zxid 4,1, which is different from the transaction L proposed with the same zxid and this transaction is acknowledged by a quorum; # L eventually gives up on leadership after noticing that it is not supported by a quorum; # L' crashes; # A new leader arises and its highest zxid is 4,1. It doesn't have to synchronize with any of the followers because they all have highest zxid 4,1. We have servers that have different transaction values for the same zxid, which constitutes an inconsistent state. servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Blocker Fix For: 3.3.4 Attachments: ZOOKEEPER-1277_br33.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149982#comment-13149982 ] Patrick Hunt commented on ZOOKEEPER-1277: - I see. Yes that would be bad. I'll try reworking the patch to drop leadership. Any suggestions on were to look to make that happen? servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Blocker Fix For: 3.3.4 Attachments: ZOOKEEPER-1277_br33.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150047#comment-13150047 ] Flavio Junqueira commented on ZOOKEEPER-1277: - For a quorum setup, it sounds like a good place would be in ProposalRequestProcessor.proposeRequest(). For standalone, it sounds like we should be doing something along the lines of what you proposed in your patch. servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Blocker Fix For: 3.3.4 Attachments: ZOOKEEPER-1277_br33.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150066#comment-13150066 ] Patrick Hunt commented on ZOOKEEPER-1277: - I'll rework the patch and get back. Thanks for the feedback Flavio. servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Blocker Fix For: 3.3.4 Attachments: ZOOKEEPER-1277_br33.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13148924#comment-13148924 ] Hadoop QA commented on ZOOKEEPER-1277: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12503459/ZOOKEEPER-1277_br33.patch against trunk revision 1201045. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/787//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/787//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/787//console This message is automatically generated. servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Blocker Fix For: 3.3.4 Attachments: ZOOKEEPER-1277_br33.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira