[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17819885#comment-17819885
 ] 

fanyang commented on ZOOKEEPER-4771:
------------------------------------

Maybe ZOOKEEPER-1748 will help

> Fast leader election taking too long
> ------------------------------------
>
>                 Key: ZOOKEEPER-4771
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4771
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.4.10
>            Reporter: Ivo Vrdoljak
>            Priority: Blocker
>         Attachments: zookeeper10.conf, zookeeper10.log, zookeeper11.log, 
> zookeeper12.log, zookeeper20.log, zookeeper21.log
>
>
> Hello ZooKeeper Community,
>  
> Background:
> We are using ZooKeeper version 3.4.10. in our system and we have 5 Zookeeper 
> servers running, that are distributed across 2 clusters of servers.
> In the first cluster, we have 3 Zookeeper servers, each deployed on its own 
> machine, and in the second cluster we have 2 Zookeeper servers, also each on 
> its own machine. Zookeeper servers that are distributed on the same cluster 
> communicate through the local network, and with the servers on the remote 
> cluster through an external network.
> The situation is the following:
>  
> {code:java}
> Cluster 1
> Zookeeper server 10
> Zookeeper server 11
> Zookeeper server 12 -> Leader
> Cluster 2
> Zookeeper server 20
> Zookeeper server 21
> {code}
>  
> Problem:
> We have an issue with Fast Leader Election when we kill the ZooKeeper leader 
> process.
> After the leader (server 12) is killed and leader election starts, we can see 
> in Zookeeper logs that voting notifications are exchanged from each Zookeeper 
> server that remained alive towards all the others. Notification on Zookeeper 
> servers located in the same cluster (communicating over the local network) 
> are successfully exchanged. The problem seems to be for Zookeeper server 
> sending votes over external network as according to the logs they are only 
> sent in one direction.
>  
> Logs from zookeeper server 10:
> {code:java}
> Nov 22 10:31:13 sc_2_1 BC[myid: 10]: INFO  LOOKING
> Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Initializing leader election 
> protocol...
> Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Updating proposal: 10 (newleader), 
> 0xe00009c97 (newzxid), 12 (oldleader), 0xd00001380 (oldzxid)
> Nov 22 10:31:13 sc_2_1 BC[myid: 10]: INFO  New election. My id =  10, 
> proposed zxid=0xe00009c97
> Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 
> (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 20 (recipient), 10 (myid), 
> 0xe (n.peerEpoch)
> Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 
> (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 21 (recipient), 10 (myid), 
> 0xe (n.peerEpoch)
> Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 
> (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 10 (myid), 
> 0xe (n.peerEpoch)
> Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 
> (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 11 (recipient), 10 (myid), 
> 0xe (n.peerEpoch)
> Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 
> (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 12 (recipient), 10 (myid), 
> 0xe (n.peerEpoch)
> Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed 
> leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11
> Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed 
> leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
> Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed 
> leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11
> Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed 
> leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
> Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed 
> leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
> Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed 
> leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
> Nov 22 10:31:15 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed 
> leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
> Nov 22 10:31:15 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed 
> leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11{code}
> Logs from zookeeper server 20:
> {code:java}
> Nov 22 10:31:13 sc_2_1 BC[myid: 20]: INFO  LOOKING
> Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Initializing leader election 
> protocol...
> Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 
> (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 20 (recipient), 20 (myid), 
> 0xe (n.peerEpoch)
> Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 
> (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 21 (recipient), 20 (myid), 
> 0xe (n.peerEpoch)
> Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 
> (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 20 (myid), 
> 0xe (n.peerEpoch)
> Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 
> (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 11 (recipient), 20 (myid), 
> 0xe (n.peerEpoch)
> Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 
> (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 12 (recipient), 20 (myid), 
> 0xe (n.peerEpoch)
> Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=11, proposed 
> leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
> Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=20, proposed 
> leader=20, proposed zxid=0xe00009c97, proposed election epoch=0x11
> Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=21, proposed 
> leader=21, proposed zxid=0xe00009c97, proposed election epoch=0x11
> Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=10, proposed 
> leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11
> Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 
> (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 20 (recipient), 20 (myid), 
> 0xe (n.peerEpoch)
> Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 
> (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 21 (recipient), 20 (myid), 
> 0xe (n.peerEpoch)
> Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 
> (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 20 (myid), 
> 0xe (n.peerEpoch)
> Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 
> (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 11 (recipient), 20 (myid), 
> 0xe (n.peerEpoch)
> Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 
> (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 12 (recipient), 20 (myid), 
> 0xe (n.peerEpoch)
> Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=21, proposed 
> leader=21, proposed zxid=0xe00009c97, proposed election epoch=0x11
> Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=10, proposed 
> leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11
> Nov 22 10:31:14 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=10, proposed 
> leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11{code}
> These are just some of the lines from the log relevant to the issue and 
> complete logs from all the servers are attached as well as the server 
> configuration.
> As it can be seen from the logs the server with ID 20 is sending 
> notifications to all cluster members but in the logs from server 10 there is 
> no indication that a notification from server 20 is received. However, it is 
> obvious from the logs that server 20 received notifications from server 10 at 
> the same time. This part is a bit strange since the same socket is used for 
> communication in both directions.
> The first suspect was the network, but after investigation by following 
> tcpdump we could not find any packages leaving from the cluster where the Zk 
> server 20 is located which is sending notifications.
> About almost half a minute later, the server realizes that there is no 
> connection to Zk servers 10 and 11 and then reopens channels towards them. 
> After that, the notifications are exchanged and the quorum is formed quickly 
> after:
> {code:java}
> Nov 22 10:31:40 sc_2_1 BC[myid: 20]: WARN  Connection broken for id 11, my id 
> = 20, error = #012java.net.SocketException: Connection timed out (Read 
> failed)#012#011at java.net.SocketInputStream.socketRead0(Native Method) 
> ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.socketRead(Unknown 
> Source) ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown 
> Source) ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown 
> Source) ~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.fill(Unknown 
> Source) ~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.read(Unknown 
> Source) ~[?:1.8.0_261]#012#011at java.io.DataInputStream.readInt(Unknown 
> Source) ~[?:1.8.0_261]#012#011at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
>  [zookeeper-3.4.10.jar:3.4.10-5f5438876cc9050ce4380a2780ce07d063aed8d7]
> Nov 22 10:31:40 sc_2_1 BC[myid: 20]: WARN  Connection broken for id 10, my id 
> = 20, error = #012java.net.SocketException: Connection timed out (Read 
> failed)#012#011at java.net.SocketInputStream.socketRead0(Native Method) 
> ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.socketRead(Unknown 
> Source) ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown 
> Source) ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown 
> Source) ~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.fill(Unknown 
> Source) ~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.read(Unknown 
> Source) ~[?:1.8.0_261]#012#011at java.io.DataInputStream.readInt(Unknown 
> Source) ~[?:1.8.0_261]#012#011at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
>  [zookeeper-3.4.10.jar:3.4.10-5f5438876cc9050ce4380a2780ce07d063aed8d7]
> .....
> Nov 22 10:31:41 sc_2_1 BC[myid: 20]: INFO  FOLLOWING
> Nov 22 10:31:41 sc_2_1 BC[myid: 20]: INFO  Created server with tickTime 1500 
> minSessionTimeout 3000 maxSessionTimeout 30000 datadir /local/version-2 
> snapdir /local/version-2
> Nov 22 10:31:41 sc_2_1 BC[myid: 20]: INFO  FOLLOWING - LEADER ELECTION TOOK - 
> 27333{code}
> We have two main questions related to this issue if you could help:
> 1) Is there any way to know for sure that the vote notification was actually 
> sent other than the following log:
> {code:java}
> Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 
> (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 20 (myid), 
> 0xe (n.peerEpoch){code}
> Or is there any chance that the receiving server would discard the 
> notification for some reason and not even log the event, since we cannot see 
> anything in the logs even with debug enabled?
> 2) Are there any options to reduce this timeout since half a minute for 
> leader election is too much for our highly available product with only 5 
> zookeeper servers? Is there any configuration parameter that could be tuned?
> Also, do you know if the same or similar issue has been already reported or 
> fixed in some of the later versions of the product? We could not find 
> something similar in Zookeeper Jira.
>  
> Thanks and regards,
> Ivo
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to