[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751546#comment-16751546 ] Ian Spence commented on ZOOKEEPER-2164: --- We can reproduce this issue with 3.4.6. 5 node ZK cluster, we restarted one node and after an hour it still has not joined the quorum. stat and mntr show "This ZooKeeper instance is not currently serving requests". > fast leader election keeps failing > -- > > Key: ZOOKEEPER-2164 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.4.5 >Reporter: Michi Mutsuzaki >Priority: Major > Fix For: 3.6.0, 3.5.5 > > > I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. > When I shut down 2, 1 and 3 keep going back to leader election. Here is what > seems to be happening. > - Both 1 and 3 elect 3 as the leader. > - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a > follower. > - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't > timeout for 5 seconds: > https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346 > - By the time 3 receives votes, 1 has given up trying to connect to 3: > https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247 > I'm using 3.4.5, but it looks like this part of the code hasn't changed for a > while, so I'm guessing later versions have the same issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695360#comment-16695360 ] Michael K. Edwards commented on ZOOKEEPER-2164: --- Is this reproducible with the current branch-3.5 code? > fast leader election keeps failing > -- > > Key: ZOOKEEPER-2164 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.4.5 >Reporter: Michi Mutsuzaki >Priority: Major > Fix For: 3.6.0, 3.5.5 > > > I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. > When I shut down 2, 1 and 3 keep going back to leader election. Here is what > seems to be happening. > - Both 1 and 3 elect 3 as the leader. > - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a > follower. > - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't > timeout for 5 seconds: > https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346 > - By the time 3 receives votes, 1 has given up trying to connect to 3: > https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247 > I'm using 3.4.5, but it looks like this part of the code hasn't changed for a > while, so I'm guessing later versions have the same issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205360#comment-16205360 ] Laurie Turner commented on ZOOKEEPER-2164: -- I believe I have run into this issue (zookeeper versions 3.4.6 and 3.4.10). These scenarios I've tested lead me to believe I have the same problem. I have a 3 node cluster and if the leader is "2" and is stopped, the election will fail and ultimately 1 and 3 respond with "This ZooKeeper instance is not currently serving requests" from the stat command. If 2 is restarted, the cluster returns and 2 becomes the leader . This appears to be the scenario documented above. Sometimes 3 will fail to rejoin but if it is restarted it will rejoin the cluster successfully. Essentially the only electable leader is #2. The nodes are built as docker containers and orchestrated using Kubernetes. I am searching for a work around or configuration change that will enable the cluster to be functional if the existing leader fails are there are only 2 nodes (out of 3) available. > fast leader election keeps failing > -- > > Key: ZOOKEEPER-2164 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.4.5 >Reporter: Michi Mutsuzaki > Fix For: 3.5.4, 3.6.0 > > > I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. > When I shut down 2, 1 and 3 keep going back to leader election. Here is what > seems to be happening. > - Both 1 and 3 elect 3 as the leader. > - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a > follower. > - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't > timeout for 5 seconds: > https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346 > - By the time 3 receives votes, 1 has given up trying to connect to 3: > https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247 > I'm using 3.4.5, but it looks like this part of the code hasn't changed for a > while, so I'm guessing later versions have the same issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605352#comment-14605352 ] BASANT KUMAR commented on ZOOKEEPER-2164: - Its on my plan to have a patch for this.I'm currently involved in internal stuff. fast leader election keeps failing -- Key: ZOOKEEPER-2164 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Reporter: Michi Mutsuzaki Assignee: Hongchao Deng Fix For: 3.5.2, 3.6.0 I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. When I shut down 2, 1 and 3 keep going back to leader election. Here is what seems to be happening. - Both 1 and 3 elect 3 as the leader. - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a follower. - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't timeout for 5 seconds: https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346 - By the time 3 receives votes, 1 has given up trying to connect to 3: https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247 I'm using 3.4.5, but it looks like this part of the code hasn't changed for a while, so I'm guessing later versions have the same issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605979#comment-14605979 ] BASANT KUMAR commented on ZOOKEEPER-2164: - ZOOKEEPER 2164-fast leader election keeps failling. fast leader election keeps failing -- Key: ZOOKEEPER-2164 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Reporter: Michi Mutsuzaki Assignee: Hongchao Deng Fix For: 3.5.2, 3.6.0 I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. When I shut down 2, 1 and 3 keep going back to leader election. Here is what seems to be happening. - Both 1 and 3 elect 3 as the leader. - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a follower. - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't timeout for 5 seconds: https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346 - By the time 3 receives votes, 1 has given up trying to connect to 3: https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247 I'm using 3.4.5, but it looks like this part of the code hasn't changed for a while, so I'm guessing later versions have the same issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603285#comment-14603285 ] Hongchao Deng commented on ZOOKEEPER-2164: -- It's on my plan to have a patch for this. I'm currently involved in internal stuff. I should be able to get onto this after that. At the mean time, it sounds like you have a good testing plan. Would be nice if you can share it. :) fast leader election keeps failing -- Key: ZOOKEEPER-2164 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Reporter: Michi Mutsuzaki Assignee: Hongchao Deng Fix For: 3.5.2, 3.6.0 I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. When I shut down 2, 1 and 3 keep going back to leader election. Here is what seems to be happening. - Both 1 and 3 elect 3 as the leader. - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a follower. - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't timeout for 5 seconds: https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346 - By the time 3 receives votes, 1 has given up trying to connect to 3: https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247 I'm using 3.4.5, but it looks like this part of the code hasn't changed for a while, so I'm guessing later versions have the same issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602876#comment-14602876 ] Filip Deleersnijder commented on ZOOKEEPER-2164: We experienced a related problem. In a test-setup with 6 servers (3.4.6) with 2 servers shut down, leader election could take a very long time ( 1 to 2 minutes ) to complete. Once we changed the cnxTO variable from 5000ms to 500ms in the QuorumCnxManager, it completed under 10 seconds again. In a setup with 8 servers (3.4.6) with 2 servers shut down, leader election could take a very long time ( We have experienced more than 10 minutes ! ) to complete and frequently started again immediately after completing. Monday we will test our cnxTO fix on this setup as well. fast leader election keeps failing -- Key: ZOOKEEPER-2164 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Reporter: Michi Mutsuzaki Assignee: Hongchao Deng Fix For: 3.5.2, 3.6.0 I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. When I shut down 2, 1 and 3 keep going back to leader election. Here is what seems to be happening. - Both 1 and 3 elect 3 as the leader. - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a follower. - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't timeout for 5 seconds: https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346 - By the time 3 receives votes, 1 has given up trying to connect to 3: https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247 I'm using 3.4.5, but it looks like this part of the code hasn't changed for a while, so I'm guessing later versions have the same issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14495709#comment-14495709 ] Michi Mutsuzaki commented on ZOOKEEPER-2164: Sure sounds good. Thank you for driving this Hongchao! fast leader election keeps failing -- Key: ZOOKEEPER-2164 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Reporter: Michi Mutsuzaki I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. When I shut down 2, 1 and 3 keep going back to leader election. Here is what seems to be happening. - Both 1 and 3 elect 3 as the leader. - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a follower. - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't timeout for 5 seconds: https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346 - By the time 3 receives votes, 1 has given up trying to connect to 3: https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247 I'm using 3.4.5, but it looks like this part of the code hasn't changed for a while, so I'm guessing later versions have the same issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494405#comment-14494405 ] Hongchao Deng commented on ZOOKEEPER-2164: -- Good catch! Actually I'm trying to refactor this part to be non-blocking :) More like: 1. Construct a connector instead of connectOne() 2. Submit the connector to connection manager. Submit() returns a Future. 3. connector is an interface for Netty to roll in. I would like to take advantage of this JIRA and discuss the design here. Any thoughts? fast leader election keeps failing -- Key: ZOOKEEPER-2164 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Reporter: Michi Mutsuzaki I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. When I shut down 2, 1 and 3 keep going back to leader election. Here is what seems to be happening. - Both 1 and 3 elect 3 as the leader. - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a follower. - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't timeout for 5 seconds: https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346 - By the time 3 receives votes, 1 has given up trying to connect to 3: https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247 I'm using 3.4.5, but it looks like this part of the code hasn't changed for a while, so I'm guessing later versions have the same issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494154#comment-14494154 ] Flavio Junqueira commented on ZOOKEEPER-2164: - connectOne is blocking, yes? shall we make it non-blocking? fast leader election keeps failing -- Key: ZOOKEEPER-2164 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Reporter: Michi Mutsuzaki I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. When I shut down 2, 1 and 3 keep going back to leader election. Here is what seems to be happening. - Both 1 and 3 elect 3 as the leader. - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a follower. - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't timeout for 5 seconds: https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346 - By the time 3 receives votes, 1 has given up trying to connect to 3: https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247 I'm using 3.4.5, but it looks like this part of the code hasn't changed for a while, so I'm guessing later versions have the same issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)