[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2017-05-23 Thread Benedict Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1609#comment-1609
 ] 

Benedict Jin commented on ZOOKEEPER-1277:
-

I created a new jira ZOOKEEPER-2789 to discuss reassign `ZXID` for solving 
32bit overflow problem. Could you please offer some advice for it?

> servers stop serving when lower 32bits of zxid roll over
> 
>
> Key: ZOOKEEPER-1277
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.3.3
>Reporter: Patrick Hunt
>Assignee: Patrick Hunt
>Priority: Critical
> Fix For: 3.3.5, 3.4.4, 3.5.0
>
> Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
> ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
> ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, 
> ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch
>
>
> When the lower 32bits of a zxid "roll over" (zxid is a 64 bit number, however 
> the upper 32 are considered the epoch number) the epoch number (upper 32 
> bits) are incremented and the lower 32 start at 0 again.
> This should work fine, however in the current 3.3 branch the followers see 
> this as a NEWLEADER message, which it's not, and effectively stop serving 
> clients. Attached clients seem to eventually time out given that heartbeats 
> (or any operation) are no longer processed. The follower doesn't recover from 
> this.
> I've tested this out on 3.3 branch and confirmed this problem, however I 
> haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
> ZOOKEEPER-335, however there is certainly an issue with updating the 
> "acceptedEpoch" files contained in the datadir. (I'll enter a separate jira 
> for that)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2017-05-23 Thread Benedict Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1607#comment-1607
 ] 

Benedict Jin commented on ZOOKEEPER-1277:
-

I see. Thank you! :D

> servers stop serving when lower 32bits of zxid roll over
> 
>
> Key: ZOOKEEPER-1277
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.3.3
>Reporter: Patrick Hunt
>Assignee: Patrick Hunt
>Priority: Critical
> Fix For: 3.3.5, 3.4.4, 3.5.0
>
> Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
> ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
> ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, 
> ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch
>
>
> When the lower 32bits of a zxid "roll over" (zxid is a 64 bit number, however 
> the upper 32 are considered the epoch number) the epoch number (upper 32 
> bits) are incremented and the lower 32 start at 0 again.
> This should work fine, however in the current 3.3 branch the followers see 
> this as a NEWLEADER message, which it's not, and effectively stop serving 
> clients. Attached clients seem to eventually time out given that heartbeats 
> (or any operation) are no longer processed. The follower doesn't recover from 
> this.
> I've tested this out on 3.3 branch and confirmed this problem, however I 
> haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
> ZOOKEEPER-335, however there is certainly an issue with updating the 
> "acceptedEpoch" files contained in the datadir. (I'll enter a separate jira 
> for that)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2017-05-23 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1601#comment-1601
 ] 

Patrick Hunt commented on ZOOKEEPER-1277:
-

[~benedict jin] Not sure I follow that question. I believe it should be ok to 
add a new server during a re-election, even if that election were triggered by 
a epoch overflow. I've never tried that however.

> servers stop serving when lower 32bits of zxid roll over
> 
>
> Key: ZOOKEEPER-1277
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.3.3
>Reporter: Patrick Hunt
>Assignee: Patrick Hunt
>Priority: Critical
> Fix For: 3.3.5, 3.4.4, 3.5.0
>
> Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
> ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
> ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, 
> ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch
>
>
> When the lower 32bits of a zxid "roll over" (zxid is a 64 bit number, however 
> the upper 32 are considered the epoch number) the epoch number (upper 32 
> bits) are incremented and the lower 32 start at 0 again.
> This should work fine, however in the current 3.3 branch the followers see 
> this as a NEWLEADER message, which it's not, and effectively stop serving 
> clients. Attached clients seem to eventually time out given that heartbeats 
> (or any operation) are no longer processed. The follower doesn't recover from 
> this.
> I've tested this out on 3.3 branch and confirmed this problem, however I 
> haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
> ZOOKEEPER-335, however there is certainly an issue with updating the 
> "acceptedEpoch" files contained in the datadir. (I'll enter a separate jira 
> for that)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2017-05-17 Thread Benedict Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16013819#comment-16013819
 ] 

Benedict Jin commented on ZOOKEEPER-1277:
-

@Patrick Hunt Hi, Patrick Hunt. If zk happend `32bits` overflow and force a 
leader re-election, but at the same time run the command `zkServer.sh start` 
from outside by my `keep alive` shell script. Is there could be a problem?

> servers stop serving when lower 32bits of zxid roll over
> 
>
> Key: ZOOKEEPER-1277
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.3.3
>Reporter: Patrick Hunt
>Assignee: Patrick Hunt
>Priority: Critical
> Fix For: 3.3.5, 3.4.4, 3.5.0
>
> Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
> ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
> ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, 
> ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch
>
>
> When the lower 32bits of a zxid "roll over" (zxid is a 64 bit number, however 
> the upper 32 are considered the epoch number) the epoch number (upper 32 
> bits) are incremented and the lower 32 start at 0 again.
> This should work fine, however in the current 3.3 branch the followers see 
> this as a NEWLEADER message, which it's not, and effectively stop serving 
> clients. Attached clients seem to eventually time out given that heartbeats 
> (or any operation) are no longer processed. The follower doesn't recover from 
> this.
> I've tested this out on 3.3 branch and confirmed this problem, however I 
> haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
> ZOOKEEPER-335, however there is certainly an issue with updating the 
> "acceptedEpoch" files contained in the datadir. (I'll enter a separate jira 
> for that)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2015-12-07 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044976#comment-15044976
 ] 

Flavio Junqueira commented on ZOOKEEPER-1277:
-

[~frenzzz] The log messages you posted say that it is triggering a new 
election, which starts a new epoch and consequently resets the zxid. What's the 
problem you're observing more precisely?

> servers stop serving when lower 32bits of zxid roll over
> 
>
> Key: ZOOKEEPER-1277
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.3.3
>Reporter: Patrick Hunt
>Assignee: Patrick Hunt
>Priority: Critical
> Fix For: 3.3.5, 3.4.4, 3.5.0
>
> Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
> ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
> ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, 
> ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch
>
>
> When the lower 32bits of a zxid "roll over" (zxid is a 64 bit number, however 
> the upper 32 are considered the epoch number) the epoch number (upper 32 
> bits) are incremented and the lower 32 start at 0 again.
> This should work fine, however in the current 3.3 branch the followers see 
> this as a NEWLEADER message, which it's not, and effectively stop serving 
> clients. Attached clients seem to eventually time out given that heartbeats 
> (or any operation) are no longer processed. The follower doesn't recover from 
> this.
> I've tested this out on 3.3 branch and confirmed this problem, however I 
> haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
> ZOOKEEPER-335, however there is certainly an issue with updating the 
> "acceptedEpoch" files contained in the datadir. (I'll enter a separate jira 
> for that)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2015-12-07 Thread Alexandr Orlov (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15045368#comment-15045368
 ] 

Alexandr Orlov commented on ZOOKEEPER-1277:
---

I mean when zxid "roll over" had occured and new leader election triggred, 
zookeeper stop serving. Leader activation at our environment
took about 30sec and  zxid roll over happens about two times per week, what is 
not pretty good. Would be great, if it possible, to find out some solution for 
avoiding  re-election.

> servers stop serving when lower 32bits of zxid roll over
> 
>
> Key: ZOOKEEPER-1277
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.3.3
>Reporter: Patrick Hunt
>Assignee: Patrick Hunt
>Priority: Critical
> Fix For: 3.3.5, 3.4.4, 3.5.0
>
> Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
> ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
> ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, 
> ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch
>
>
> When the lower 32bits of a zxid "roll over" (zxid is a 64 bit number, however 
> the upper 32 are considered the epoch number) the epoch number (upper 32 
> bits) are incremented and the lower 32 start at 0 again.
> This should work fine, however in the current 3.3 branch the followers see 
> this as a NEWLEADER message, which it's not, and effectively stop serving 
> clients. Attached clients seem to eventually time out given that heartbeats 
> (or any operation) are no longer processed. The follower doesn't recover from 
> this.
> I've tested this out on 3.3 branch and confirmed this problem, however I 
> haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
> ZOOKEEPER-335, however there is certainly an issue with updating the 
> "acceptedEpoch" files contained in the datadir. (I'll enter a separate jira 
> for that)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2015-12-07 Thread Alexandr Orlov (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044853#comment-15044853
 ] 

Alexandr Orlov commented on ZOOKEEPER-1277:
---

Hi! We still have a problem with versions 3.4.6, 3.5.1-alpha:
{noformat}
2015-12-03 11:02:48,073 - ERROR [ProcessThread(sid:5 
cport:-1)::org.apache.zookeeper.server.PrepRequestProcessor@139] - Unexpected 
exception
org.apache.zookeeper.server.RequestProcessor$RequestProcessorException: zxid 
lower 32 bits have rolled over, forcing re-election, and therefore new epoch 
start
at 
org.apache.zookeeper.server.quorum.ProposalRequestProcessor.processRequest(ProposalRequestProcessor.java:80)
at 
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:673)
at 
org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:131)
Caused by: org.apache.zookeeper.server.quorum.Leader$XidRolloverException: zxid 
lower 32 bits have rolled over, forcing re-election, and therefore new epoch 
start
at org.apache.zookeeper.server.quorum.Leader.propose(Leader.java:746)
at 
org.apache.zookeeper.server.quorum.ProposalRequestProcessor.processRequest(ProposalRequestProcessor.java:78)
... 2 more
2015-12-03 11:02:48,073 - WARN  
[LearnerHandler-/2a02:6b8:0:1602:37a6:f71a:79c1:e5f3:48040:org.apache.zookeeper.server.quorum.LearnerHandler@658]
 - Ignoring unexpected exception
java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
at 
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
at 
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
at 
org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:656)
at 
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:649)
2015-12-03 11:02:49,766 - INFO  
[QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2183:org.apache.zookeeper.server.quorum.Leader@493]
 - Shutting down
2015-12-03 11:02:49,766 - INFO  
[QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2183:org.apache.zookeeper.server.quorum.QuorumPeer@714]
 - LOOKING
2015-12-03 11:02:49,766 - DEBUG 
[QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2183:org.apache.zookeeper.server.quorum.QuorumPeer@645]
 - Initializing leader election protocol...
{noformat}

As i seen at https://zookeeper.apache.org/doc/r3.5.1-alpha/releasenotes.html 
that problem should be resolved, but it isn't

> servers stop serving when lower 32bits of zxid roll over
> 
>
> Key: ZOOKEEPER-1277
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.3.3
>Reporter: Patrick Hunt
>Assignee: Patrick Hunt
>Priority: Critical
> Fix For: 3.3.5, 3.4.4, 3.5.0
>
> Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
> ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
> ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, 
> ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch
>
>
> When the lower 32bits of a zxid "roll over" (zxid is a 64 bit number, however 
> the upper 32 are considered the epoch number) the epoch number (upper 32 
> bits) are incremented and the lower 32 start at 0 again.
> This should work fine, however in the current 3.3 branch the followers see 
> this as a NEWLEADER message, which it's not, and effectively stop serving 
> clients. Attached clients seem to eventually time out given that heartbeats 
> (or any operation) are no longer processed. The follower doesn't recover from 
> this.
> I've tested this out on 3.3 branch and confirmed this problem, however I 
> haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
> ZOOKEEPER-335, however there is certainly an issue with updating the 
> "acceptedEpoch" files contained in the datadir. (I'll enter a separate jira 
> for that)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2014-02-28 Thread Lu Xuehui (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915635#comment-13915635
 ] 

Lu Xuehui commented on ZOOKEEPER-1277:
--

when the zixd roll over, the epoch++ ; a new leader arises ,the epoch += 2.  
this way can avoid throw Exception ?

 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Critical
 Fix For: 3.3.5, 3.4.4, 3.5.0

 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, 
 ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2013-04-16 Thread Dave Latham (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633420#comment-13633420
 ] 

Dave Latham commented on ZOOKEEPER-1277:


We recently experienced an HBase outage that I believe was caused by this 
issue.  Running on ZK 3.4.4, the log for the leader shows this:

{noformat}
2013-04-12 17:46:25,894 INFO org.apache.zookeeper.server.quorum.Leader: Have 
quorum of supporters; starting up and setting last processed zxid: 0x1a0004
2013-04-12 17:46:25,895 WARN org.apache.zookeeper.server.FinalRequestProcessor: 
Zxid outstanding 111669149696 is less than current 111669149697
2013-04-12 17:46:25,895 WARN org.apache.zookeeper.server.quorum.LearnerHandler: 
*** GOODBYE /10.0.1.100:34796 
2013-04-12 17:46:25,896 ERROR org.apache.zookeeper.server.NIOServerCnxnFactory: 
Thread LearnerHandler Socket[addr=/10.0.1.100,port=34796,localport=2888] 
tickOfLastAck:897811 synced?:true queuedPacketLength:0 died
java.lang.IllegalThreadStateException
at java.lang.Thread.start(Thread.java:638)
at 
org.apache.zookeeper.server.quorum.LeaderZooKeeperServer.startSessionTracker(LeaderZooKeeperServer.java:87)
at 
org.apache.zookeeper.server.ZooKeeperServer.startup(ZooKeeperServer.java:394)
at org.apache.zookeeper.server.quorum.Leader.processAck(Leader.java:531)
at 
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:497)
{noformat}

Immediately after this one of the followers had a new election and became a 
follower again.  Also, the heap on the leader immediately climbed until the 
process became stuck spending most of its time in GC.  At this point HBase 
region servers started dropping like flies and then the ZK node was killed.

I'm adding this comment now for two purposes.  First, so that if other people 
see the same symptom in their logs they may find this issue faster.  Second, 
I'd love to hear from anyone more familiar with ZooKeeper if this issue does 
indeeed explain the observations I wrote and mentioned above.

 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Critical
 Fix For: 3.3.5, 3.4.4, 3.5.0

 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, 
 ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2013-04-16 Thread Dave Latham (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633423#comment-13633423
 ] 

Dave Latham commented on ZOOKEEPER-1277:


Excuse me, we were running 3.4.3, not 3.4.4

 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Critical
 Fix For: 3.3.5, 3.4.4, 3.5.0

 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, 
 ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2013-04-16 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633563#comment-13633563
 ] 

Patrick Hunt commented on ZOOKEEPER-1277:
-

Hi [~davelatham], it seems unlikely to me. Are you only running hbase against 
ZK? Because in that case the number of changes to zk are going to be  than 
4billion (the amount necessary to roll over the lower 32 bits), hbase just 
doesn't generate that much traffic. I've only seen the rollover case with 10k's 
of clients doing large numbers of operations per second. hbase just doesn't 
drive that much traffic - it's mainly for failover and table management.

You might have hit an issue with 3.4 that was fixed in a subsequent release. 
However the symptoms you mentioned don't ring a bell either

 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Critical
 Fix For: 3.3.5, 3.4.4, 3.5.0

 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, 
 ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2013-04-16 Thread Dave Latham (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633565#comment-13633565
 ] 

Dave Latham commented on ZOOKEEPER-1277:


Thanks for the response, [~phunt].  It is only HBase, but there are 1000 region 
servers and are using replication which puts much greater load on ZK.  Taking a 
recent sample I see the zxid going up by thousands per second.

 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Critical
 Fix For: 3.3.5, 3.4.4, 3.5.0

 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, 
 ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2013-04-16 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633580#comment-13633580
 ] 

Patrick Hunt commented on ZOOKEEPER-1277:
-

[~davelatham] this could be it then. 1k's/sec means ~ a month before rollover.

 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Critical
 Fix For: 3.3.5, 3.4.4, 3.5.0

 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, 
 ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2012-03-16 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231062#comment-13231062
 ] 

Hudson commented on ZOOKEEPER-1277:
---

Integrated in ZooKeeper-trunk #1493 (See 
[https://builds.apache.org/job/ZooKeeper-trunk/1493/])
ZOOKEEPER-1277. servers stop serving when lower 32bits of zxid roll over 
(phunt) (Revision 1301079)

 Result = SUCCESS
phunt : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1301079
Files : 
* /zookeeper/trunk/CHANGES.txt
* 
/zookeeper/trunk/src/java/main/org/apache/zookeeper/server/PrepRequestProcessor.java
* 
/zookeeper/trunk/src/java/main/org/apache/zookeeper/server/RequestProcessor.java
* 
/zookeeper/trunk/src/java/main/org/apache/zookeeper/server/SyncRequestProcessor.java
* 
/zookeeper/trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java
* /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java
* 
/zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/ProposalRequestProcessor.java
* 
/zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/ReadOnlyRequestProcessor.java
* 
/zookeeper/trunk/src/java/test/org/apache/zookeeper/server/ZxidRolloverTest.java
* /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/ClientBase.java


 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Critical
 Fix For: 3.3.5, 3.4.4, 3.5.0

 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, 
 ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2012-03-15 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229937#comment-13229937
 ] 

Hadoop QA commented on ZOOKEEPER-1277:
--

+1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12518426/ZOOKEEPER-1277_trunk.patch
  against trunk revision 1297740.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 5 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/994//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/994//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/994//console

This message is automatically generated.

 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Critical
 Fix For: 3.3.5, 3.4.4, 3.5.0

 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, 
 ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2012-03-15 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229941#comment-13229941
 ] 

Mahadev konar commented on ZOOKEEPER-1277:
--

+1 on the patches. Looked through all 3. Good to go! Thanks Pat!

 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Critical
 Fix For: 3.3.5, 3.4.4, 3.5.0

 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, 
 ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2012-03-14 Thread Flavio Junqueira (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229656#comment-13229656
 ] 

Flavio Junqueira commented on ZOOKEEPER-1277:
-

Ok, I only looked at propose() as you suggested, Pat. That method sounds right: 
it forces a leader election when we reach the limit. However, I'm not sure how 
we guarantee that Zab will work correctly under this exception. It is an 
invariant of the protocol that a follower won't go back to a previous epoch; if 
we roll over, then followers will have to go back to a previous epoch, no? How 
do we make sure that it doesn't break the protocol implementation? 

 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Critical
 Fix For: 3.3.6

 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2012-03-14 Thread Patrick Hunt (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229672#comment-13229672
 ] 

Patrick Hunt commented on ZOOKEEPER-1277:
-

That's correct, based on the feedback I got from the previous attempt it was 
clear that we cannot continue without a re-election. In this case I'm looking 
for the just about to occur rollover and I'm dropping leadership at that 
point. The re-election will then happen, a new epoch chosen, and the lower 
32bit thereby reset.


 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Critical
 Fix For: 3.3.6

 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2012-03-14 Thread Patrick Hunt (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229704#comment-13229704
 ] 

Patrick Hunt commented on ZOOKEEPER-1277:
-

haha, yea I hear you. This is not for unit test testing though. I use setZxid 
for that (see the original test). 

The system property is to allow QA to test this on a real cluster. I've used 
this for the first level of verification - I started a 3 node cluster with this 
system property and used a std client to force the re-election (by creating 
znodes for example). I can then see that the real servers are operating 
properly and handle this case - without waiting for a month of writes to go 
through the system. That make more sense? (I'll update the comment)

 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Critical
 Fix For: 3.3.6

 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2012-03-14 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229708#comment-13229708
 ] 

Mahadev konar commented on ZOOKEEPER-1277:
--

Ahh... That makes more sense! Updated comments would be good. Thanks!

 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Critical
 Fix For: 3.3.6

 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2012-03-14 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229906#comment-13229906
 ] 

Hadoop QA commented on ZOOKEEPER-1277:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12518421/ZOOKEEPER-1277_trunk.patch
  against trunk revision 1297740.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 5 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 1 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/993//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/993//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/993//console

This message is automatically generated.

 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Critical
 Fix For: 3.3.5, 3.4.4, 3.5.0

 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br34.patch, 
 ZOOKEEPER-1277_trunk.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2011-11-14 Thread Patrick Hunt (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149943#comment-13149943
 ] 

Patrick Hunt commented on ZOOKEEPER-1277:
-

I thought about that but it seemed like a bad idea for 2 reasons I could think 
of:
1) it would cause all of the clients to disconnect and reconnect unnecessarily, 
perhaps introducing instability in the process.
2) can we guarantee that the leader will give up leadership? ie how to effect 
this, exit the JVM on the leader?

In talking with Ben about it in the past (perhaps he's since changed his mind) 
he seemed to think that rolling over to a new epoch number (with no leader 
re-election) was OK.

 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Blocker
 Fix For: 3.3.4

 Attachments: ZOOKEEPER-1277_br33.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2011-11-14 Thread Flavio Junqueira (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149973#comment-13149973
 ] 

Flavio Junqueira commented on ZOOKEEPER-1277:
-

The scenario I have in mind to say this is incorrect is more or less the 
following:

# Leader L is currently in epoch 3 and it moves to epoch 4 in the way this 
patch proposes by simply adding 2 to hzxid. The leader proposes a transaction 
with zxid 4,1, which is acknowledged by some follower F, but not a quorum; 
# Concurrently, a new leader L' arises and selects 4 as its epoch (it hasn't 
talked to L or F);
# L' proposes a transaction with zxid 4,1, which is different from the 
transaction L proposed with the same zxid and this transaction is acknowledged 
by a quorum;
# L eventually gives up on leadership after noticing that it is not supported 
by a quorum;
# L' crashes;
# A new leader arises and its highest zxid is 4,1. It doesn't have to 
synchronize with any of the followers because they all have highest zxid 4,1. 
We have servers that have different transaction values for the same zxid, which 
constitutes an inconsistent state. 


 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Blocker
 Fix For: 3.3.4

 Attachments: ZOOKEEPER-1277_br33.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2011-11-14 Thread Patrick Hunt (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149982#comment-13149982
 ] 

Patrick Hunt commented on ZOOKEEPER-1277:
-

I see. Yes that would be bad. I'll try reworking the patch to drop leadership. 
Any suggestions on were to look to make that happen?

 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Blocker
 Fix For: 3.3.4

 Attachments: ZOOKEEPER-1277_br33.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2011-11-14 Thread Flavio Junqueira (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150047#comment-13150047
 ] 

Flavio Junqueira commented on ZOOKEEPER-1277:
-

For a quorum setup, it sounds like a good place would be in 
ProposalRequestProcessor.proposeRequest(). For standalone, it sounds like we 
should be doing something along the lines of what you proposed in your patch. 

 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Blocker
 Fix For: 3.3.4

 Attachments: ZOOKEEPER-1277_br33.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2011-11-14 Thread Patrick Hunt (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150066#comment-13150066
 ] 

Patrick Hunt commented on ZOOKEEPER-1277:
-

I'll rework the patch and get back. Thanks for the feedback Flavio.


 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Blocker
 Fix For: 3.3.4

 Attachments: ZOOKEEPER-1277_br33.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2011-11-11 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13148924#comment-13148924
 ] 

Hadoop QA commented on ZOOKEEPER-1277:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12503459/ZOOKEEPER-1277_br33.patch
  against trunk revision 1201045.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 5 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/787//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/787//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/787//console

This message is automatically generated.

 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Blocker
 Fix For: 3.3.4

 Attachments: ZOOKEEPER-1277_br33.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira