[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing

2019-01-24 Thread Ian Spence (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751546#comment-16751546
 ] 

Ian Spence commented on ZOOKEEPER-2164:
---

We can reproduce this issue with 3.4.6.

5 node ZK cluster, we restarted one node and after an hour it still has not 
joined the quorum.

stat and mntr show "This ZooKeeper instance is not currently serving requests".

> fast leader election keeps failing
> --
>
> Key: ZOOKEEPER-2164
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection
>Affects Versions: 3.4.5
>Reporter: Michi Mutsuzaki
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
>
> I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. 
> When I shut down 2, 1 and 3 keep going back to leader election. Here is what 
> seems to be happening.
> - Both 1 and 3 elect 3 as the leader.
> - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a 
> follower.
> - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't 
> timeout for 5 seconds: 
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346
> - By the time 3 receives votes, 1 has given up trying to connect to 3: 
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247
> I'm using 3.4.5, but it looks like this part of the code hasn't changed for a 
> while, so I'm guessing later versions have the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695360#comment-16695360
 ] 

Michael K. Edwards commented on ZOOKEEPER-2164:
---

Is this reproducible with the current branch-3.5 code?

> fast leader election keeps failing
> --
>
> Key: ZOOKEEPER-2164
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection
>Affects Versions: 3.4.5
>Reporter: Michi Mutsuzaki
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
>
> I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. 
> When I shut down 2, 1 and 3 keep going back to leader election. Here is what 
> seems to be happening.
> - Both 1 and 3 elect 3 as the leader.
> - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a 
> follower.
> - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't 
> timeout for 5 seconds: 
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346
> - By the time 3 receives votes, 1 has given up trying to connect to 3: 
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247
> I'm using 3.4.5, but it looks like this part of the code hasn't changed for a 
> while, so I'm guessing later versions have the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing

2017-10-15 Thread Laurie Turner (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205360#comment-16205360
 ] 

Laurie Turner commented on ZOOKEEPER-2164:
--

I believe I have run into this issue (zookeeper versions 3.4.6  and 3.4.10).  

These scenarios I've tested  lead me to believe I have the same problem.  I 
have a 3 node cluster and if the leader is "2" and is stopped, the election 
will fail and ultimately 1 and 3  respond with  "This ZooKeeper instance is not 
currently serving requests" from the stat command.  

If 2 is restarted, the cluster returns and 2 becomes the leader . This 
appears to be the scenario documented above.  Sometimes 3 will fail to rejoin 
but if it is restarted it will rejoin the cluster successfully.

Essentially the only electable leader is #2.  The nodes are built as docker 
containers and orchestrated using Kubernetes.  

I am searching  for a work around or  configuration change that will enable the 
 cluster to be functional if the existing leader fails  are there are only 2 
nodes (out of 3) available.

> fast leader election keeps failing
> --
>
> Key: ZOOKEEPER-2164
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection
>Affects Versions: 3.4.5
>Reporter: Michi Mutsuzaki
> Fix For: 3.5.4, 3.6.0
>
>
> I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. 
> When I shut down 2, 1 and 3 keep going back to leader election. Here is what 
> seems to be happening.
> - Both 1 and 3 elect 3 as the leader.
> - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a 
> follower.
> - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't 
> timeout for 5 seconds: 
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346
> - By the time 3 receives votes, 1 has given up trying to connect to 3: 
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247
> I'm using 3.4.5, but it looks like this part of the code hasn't changed for a 
> while, so I'm guessing later versions have the same issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing

2015-06-29 Thread BASANT KUMAR (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605352#comment-14605352
 ] 

BASANT KUMAR commented on ZOOKEEPER-2164:
-

Its on my plan to have a patch for this.I'm currently involved in internal 
stuff.

 fast leader election keeps failing
 --

 Key: ZOOKEEPER-2164
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
Reporter: Michi Mutsuzaki
Assignee: Hongchao Deng
 Fix For: 3.5.2, 3.6.0


 I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. 
 When I shut down 2, 1 and 3 keep going back to leader election. Here is what 
 seems to be happening.
 - Both 1 and 3 elect 3 as the leader.
 - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a 
 follower.
 - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't 
 timeout for 5 seconds: 
 https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346
 - By the time 3 receives votes, 1 has given up trying to connect to 3: 
 https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247
 I'm using 3.4.5, but it looks like this part of the code hasn't changed for a 
 while, so I'm guessing later versions have the same issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing

2015-06-29 Thread BASANT KUMAR (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605979#comment-14605979
 ] 

BASANT KUMAR commented on ZOOKEEPER-2164:
-

ZOOKEEPER 2164-fast leader election keeps failling.

 fast leader election keeps failing
 --

 Key: ZOOKEEPER-2164
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
Reporter: Michi Mutsuzaki
Assignee: Hongchao Deng
 Fix For: 3.5.2, 3.6.0


 I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. 
 When I shut down 2, 1 and 3 keep going back to leader election. Here is what 
 seems to be happening.
 - Both 1 and 3 elect 3 as the leader.
 - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a 
 follower.
 - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't 
 timeout for 5 seconds: 
 https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346
 - By the time 3 receives votes, 1 has given up trying to connect to 3: 
 https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247
 I'm using 3.4.5, but it looks like this part of the code hasn't changed for a 
 while, so I'm guessing later versions have the same issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing

2015-06-26 Thread Hongchao Deng (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603285#comment-14603285
 ] 

Hongchao Deng commented on ZOOKEEPER-2164:
--

It's on my plan to have a patch for this. I'm currently involved in internal 
stuff. I should be able to get onto this after that.

At the mean time, it sounds like you have a good testing plan. Would be nice if 
you can share it. :)

 fast leader election keeps failing
 --

 Key: ZOOKEEPER-2164
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
Reporter: Michi Mutsuzaki
Assignee: Hongchao Deng
 Fix For: 3.5.2, 3.6.0


 I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. 
 When I shut down 2, 1 and 3 keep going back to leader election. Here is what 
 seems to be happening.
 - Both 1 and 3 elect 3 as the leader.
 - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a 
 follower.
 - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't 
 timeout for 5 seconds: 
 https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346
 - By the time 3 receives votes, 1 has given up trying to connect to 3: 
 https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247
 I'm using 3.4.5, but it looks like this part of the code hasn't changed for a 
 while, so I'm guessing later versions have the same issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing

2015-06-26 Thread Filip Deleersnijder (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602876#comment-14602876
 ] 

Filip Deleersnijder commented on ZOOKEEPER-2164:


We experienced a related problem.  

In a test-setup with 6 servers (3.4.6) with 2 servers shut down, leader 
election could take a very long time ( 1 to 2 minutes ) to complete. Once we 
changed the cnxTO variable from 5000ms to 500ms in the QuorumCnxManager, it 
completed under 10 seconds again.

In a setup with 8 servers (3.4.6) with 2 servers shut down, leader election 
could take a very long time ( We have experienced more than 10 minutes ! ) to 
complete and frequently started again immediately after completing.
Monday we will test our cnxTO fix on this setup as well.


 fast leader election keeps failing
 --

 Key: ZOOKEEPER-2164
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
Reporter: Michi Mutsuzaki
Assignee: Hongchao Deng
 Fix For: 3.5.2, 3.6.0


 I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. 
 When I shut down 2, 1 and 3 keep going back to leader election. Here is what 
 seems to be happening.
 - Both 1 and 3 elect 3 as the leader.
 - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a 
 follower.
 - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't 
 timeout for 5 seconds: 
 https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346
 - By the time 3 receives votes, 1 has given up trying to connect to 3: 
 https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247
 I'm using 3.4.5, but it looks like this part of the code hasn't changed for a 
 while, so I'm guessing later versions have the same issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing

2015-04-15 Thread Michi Mutsuzaki (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14495709#comment-14495709
 ] 

Michi Mutsuzaki commented on ZOOKEEPER-2164:


Sure sounds good. Thank you for driving this Hongchao!

 fast leader election keeps failing
 --

 Key: ZOOKEEPER-2164
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
Reporter: Michi Mutsuzaki

 I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. 
 When I shut down 2, 1 and 3 keep going back to leader election. Here is what 
 seems to be happening.
 - Both 1 and 3 elect 3 as the leader.
 - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a 
 follower.
 - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't 
 timeout for 5 seconds: 
 https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346
 - By the time 3 receives votes, 1 has given up trying to connect to 3: 
 https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247
 I'm using 3.4.5, but it looks like this part of the code hasn't changed for a 
 while, so I'm guessing later versions have the same issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing

2015-04-14 Thread Hongchao Deng (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494405#comment-14494405
 ] 

Hongchao Deng commented on ZOOKEEPER-2164:
--

Good catch!

Actually I'm trying to refactor this part to be non-blocking :) More like:

1. Construct a connector instead of connectOne()
2. Submit the connector to connection manager. Submit() returns a Future.
3. connector is an interface for Netty to roll in.

I would like to take advantage of this JIRA and discuss the design here. Any 
thoughts?

 fast leader election keeps failing
 --

 Key: ZOOKEEPER-2164
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
Reporter: Michi Mutsuzaki

 I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. 
 When I shut down 2, 1 and 3 keep going back to leader election. Here is what 
 seems to be happening.
 - Both 1 and 3 elect 3 as the leader.
 - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a 
 follower.
 - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't 
 timeout for 5 seconds: 
 https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346
 - By the time 3 receives votes, 1 has given up trying to connect to 3: 
 https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247
 I'm using 3.4.5, but it looks like this part of the code hasn't changed for a 
 while, so I'm guessing later versions have the same issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing

2015-04-14 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494154#comment-14494154
 ] 

Flavio Junqueira commented on ZOOKEEPER-2164:
-

connectOne is blocking, yes? shall we make it non-blocking?

 fast leader election keeps failing
 --

 Key: ZOOKEEPER-2164
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
Reporter: Michi Mutsuzaki

 I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. 
 When I shut down 2, 1 and 3 keep going back to leader election. Here is what 
 seems to be happening.
 - Both 1 and 3 elect 3 as the leader.
 - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a 
 follower.
 - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't 
 timeout for 5 seconds: 
 https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346
 - By the time 3 receives votes, 1 has given up trying to connect to 3: 
 https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247
 I'm using 3.4.5, but it looks like this part of the code hasn't changed for a 
 while, so I'm guessing later versions have the same issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)