[ https://issues.apache.org/jira/browse/ZOOKEEPER-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17819885#comment-17819885 ]
fanyang commented on ZOOKEEPER-4771: ------------------------------------ Maybe ZOOKEEPER-1748 will help > Fast leader election taking too long > ------------------------------------ > > Key: ZOOKEEPER-4771 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4771 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection > Affects Versions: 3.4.10 > Reporter: Ivo Vrdoljak > Priority: Blocker > Attachments: zookeeper10.conf, zookeeper10.log, zookeeper11.log, > zookeeper12.log, zookeeper20.log, zookeeper21.log > > > Hello ZooKeeper Community, > > Background: > We are using ZooKeeper version 3.4.10. in our system and we have 5 Zookeeper > servers running, that are distributed across 2 clusters of servers. > In the first cluster, we have 3 Zookeeper servers, each deployed on its own > machine, and in the second cluster we have 2 Zookeeper servers, also each on > its own machine. Zookeeper servers that are distributed on the same cluster > communicate through the local network, and with the servers on the remote > cluster through an external network. > The situation is the following: > > {code:java} > Cluster 1 > Zookeeper server 10 > Zookeeper server 11 > Zookeeper server 12 -> Leader > Cluster 2 > Zookeeper server 20 > Zookeeper server 21 > {code} > > Problem: > We have an issue with Fast Leader Election when we kill the ZooKeeper leader > process. > After the leader (server 12) is killed and leader election starts, we can see > in Zookeeper logs that voting notifications are exchanged from each Zookeeper > server that remained alive towards all the others. Notification on Zookeeper > servers located in the same cluster (communicating over the local network) > are successfully exchanged. The problem seems to be for Zookeeper server > sending votes over external network as according to the logs they are only > sent in one direction. > > Logs from zookeeper server 10: > {code:java} > Nov 22 10:31:13 sc_2_1 BC[myid: 10]: INFO LOOKING > Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Initializing leader election > protocol... > Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Updating proposal: 10 (newleader), > 0xe00009c97 (newzxid), 12 (oldleader), 0xd00001380 (oldzxid) > Nov 22 10:31:13 sc_2_1 BC[myid: 10]: INFO New election. My id = 10, > proposed zxid=0xe00009c97 > Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 > (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 20 (recipient), 10 (myid), > 0xe (n.peerEpoch) > Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 > (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 21 (recipient), 10 (myid), > 0xe (n.peerEpoch) > Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 > (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 10 (myid), > 0xe (n.peerEpoch) > Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 > (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 11 (recipient), 10 (myid), > 0xe (n.peerEpoch) > Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 > (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 12 (recipient), 10 (myid), > 0xe (n.peerEpoch) > Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed > leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11 > Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed > leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11 > Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed > leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11 > Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed > leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11 > Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed > leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11 > Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed > leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11 > Nov 22 10:31:15 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed > leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11 > Nov 22 10:31:15 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed > leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11{code} > Logs from zookeeper server 20: > {code:java} > Nov 22 10:31:13 sc_2_1 BC[myid: 20]: INFO LOOKING > Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Initializing leader election > protocol... > Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 > (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 20 (recipient), 20 (myid), > 0xe (n.peerEpoch) > Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 > (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 21 (recipient), 20 (myid), > 0xe (n.peerEpoch) > Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 > (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 20 (myid), > 0xe (n.peerEpoch) > Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 > (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 11 (recipient), 20 (myid), > 0xe (n.peerEpoch) > Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 > (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 12 (recipient), 20 (myid), > 0xe (n.peerEpoch) > Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=11, proposed > leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11 > Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=20, proposed > leader=20, proposed zxid=0xe00009c97, proposed election epoch=0x11 > Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=21, proposed > leader=21, proposed zxid=0xe00009c97, proposed election epoch=0x11 > Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=10, proposed > leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11 > Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 > (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 20 (recipient), 20 (myid), > 0xe (n.peerEpoch) > Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 > (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 21 (recipient), 20 (myid), > 0xe (n.peerEpoch) > Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 > (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 20 (myid), > 0xe (n.peerEpoch) > Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 > (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 11 (recipient), 20 (myid), > 0xe (n.peerEpoch) > Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 > (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 12 (recipient), 20 (myid), > 0xe (n.peerEpoch) > Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=21, proposed > leader=21, proposed zxid=0xe00009c97, proposed election epoch=0x11 > Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=10, proposed > leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11 > Nov 22 10:31:14 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=10, proposed > leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11{code} > These are just some of the lines from the log relevant to the issue and > complete logs from all the servers are attached as well as the server > configuration. > As it can be seen from the logs the server with ID 20 is sending > notifications to all cluster members but in the logs from server 10 there is > no indication that a notification from server 20 is received. However, it is > obvious from the logs that server 20 received notifications from server 10 at > the same time. This part is a bit strange since the same socket is used for > communication in both directions. > The first suspect was the network, but after investigation by following > tcpdump we could not find any packages leaving from the cluster where the Zk > server 20 is located which is sending notifications. > About almost half a minute later, the server realizes that there is no > connection to Zk servers 10 and 11 and then reopens channels towards them. > After that, the notifications are exchanged and the quorum is formed quickly > after: > {code:java} > Nov 22 10:31:40 sc_2_1 BC[myid: 20]: WARN Connection broken for id 11, my id > = 20, error = #012java.net.SocketException: Connection timed out (Read > failed)#012#011at java.net.SocketInputStream.socketRead0(Native Method) > ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.socketRead(Unknown > Source) ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown > Source) ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown > Source) ~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.fill(Unknown > Source) ~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.read(Unknown > Source) ~[?:1.8.0_261]#012#011at java.io.DataInputStream.readInt(Unknown > Source) ~[?:1.8.0_261]#012#011at > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013) > [zookeeper-3.4.10.jar:3.4.10-5f5438876cc9050ce4380a2780ce07d063aed8d7] > Nov 22 10:31:40 sc_2_1 BC[myid: 20]: WARN Connection broken for id 10, my id > = 20, error = #012java.net.SocketException: Connection timed out (Read > failed)#012#011at java.net.SocketInputStream.socketRead0(Native Method) > ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.socketRead(Unknown > Source) ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown > Source) ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown > Source) ~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.fill(Unknown > Source) ~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.read(Unknown > Source) ~[?:1.8.0_261]#012#011at java.io.DataInputStream.readInt(Unknown > Source) ~[?:1.8.0_261]#012#011at > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013) > [zookeeper-3.4.10.jar:3.4.10-5f5438876cc9050ce4380a2780ce07d063aed8d7] > ..... > Nov 22 10:31:41 sc_2_1 BC[myid: 20]: INFO FOLLOWING > Nov 22 10:31:41 sc_2_1 BC[myid: 20]: INFO Created server with tickTime 1500 > minSessionTimeout 3000 maxSessionTimeout 30000 datadir /local/version-2 > snapdir /local/version-2 > Nov 22 10:31:41 sc_2_1 BC[myid: 20]: INFO FOLLOWING - LEADER ELECTION TOOK - > 27333{code} > We have two main questions related to this issue if you could help: > 1) Is there any way to know for sure that the vote notification was actually > sent other than the following log: > {code:java} > Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 > (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 20 (myid), > 0xe (n.peerEpoch){code} > Or is there any chance that the receiving server would discard the > notification for some reason and not even log the event, since we cannot see > anything in the logs even with debug enabled? > 2) Are there any options to reduce this timeout since half a minute for > leader election is too much for our highly available product with only 5 > zookeeper servers? Is there any configuration parameter that could be tuned? > Also, do you know if the same or similar issue has been already reported or > fixed in some of the later versions of the product? We could not find > something similar in Zookeeper Jira. > > Thanks and regards, > Ivo > -- This message was sent by Atlassian Jira (v8.20.10#820010)