Ivo Vrdoljak created ZOOKEEPER-4771:
---------------------------------------

             Summary: Fast leader election taking too long
                 Key: ZOOKEEPER-4771
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4771
             Project: ZooKeeper
          Issue Type: Bug
          Components: leaderElection
    Affects Versions: 3.4.10
            Reporter: Ivo Vrdoljak
         Attachments: zookeeper10.log, zookeeper11.log, zookeeper12.log, 
zookeeper20.log, zookeeper21.log

Hello ZooKeeper Community,
 
Background:
We are using ZooKeeper version 3.4.10. in our system and we have 5 Zookeeper 
servers running, that are distributed across 2 clusters of servers.
In the first cluster, we have 3 Zookeeper servers, each deployed on its own 
machine, and in the second cluster we have 2 Zookeeper servers, also each on 
its own machine. Zookeeper servers that are distributed on the same cluster 
communicate through the local network, and with the servers on the remote 
cluster through an external network.
The situation is the following:

 
{code:java}
Cluster 1
Zookeeper server 10
Zookeeper server 11
Zookeeper server 12 -> Leader
Cluster 2
Zookeeper server 20
Zookeeper server 21
{code}
 

Problem:
We have an issue with Fast Leader Election when we kill the ZooKeeper leader 
process.
After the leader (server 12) is killed and leader election starts, we can see 
in Zookeeper logs that voting notifications are exchanged from each Zookeeper 
server that remained alive towards all the others. Notification on Zookeeper 
servers located in the same cluster (communicating over the local network) are 
successfully exchanged. The problem seems to be for Zookeeper server sending 
votes over external network as according to the logs they are only sent in one 
direction.
 
Logs from zookeeper server 10:
{code:java}
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: INFO  LOOKING
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Initializing leader election 
protocol...
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Updating proposal: 10 (newleader), 
0xe00009c97 (newzxid), 12 (oldleader), 0xd00001380 (oldzxid)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: INFO  New election. My id =  10, proposed 
zxid=0xe00009c97
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 
0xe00009c97 (n.zxid), 0x11 (n.round), 20 (recipient), 10 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 
0xe00009c97 (n.zxid), 0x11 (n.round), 21 (recipient), 10 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 
0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 10 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 
0xe00009c97 (n.zxid), 0x11 (n.round), 11 (recipient), 10 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 
0xe00009c97 (n.zxid), 0x11 (n.round), 12 (recipient), 10 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed 
leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed 
leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed 
leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed 
leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed 
leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed 
leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:15 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed 
leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:15 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed 
leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11{code}
Logs from zookeeper server 20:
{code:java}
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: INFO  LOOKING
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Initializing leader election 
protocol...
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 
0xe00009c97 (n.zxid), 0x11 (n.round), 20 (recipient), 20 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 
0xe00009c97 (n.zxid), 0x11 (n.round), 21 (recipient), 20 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 
0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 20 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 
0xe00009c97 (n.zxid), 0x11 (n.round), 11 (recipient), 20 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 
0xe00009c97 (n.zxid), 0x11 (n.round), 12 (recipient), 20 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=11, proposed 
leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=20, proposed 
leader=20, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=21, proposed 
leader=21, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=10, proposed 
leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader), 
0xe00009c97 (n.zxid), 0x11 (n.round), 20 (recipient), 20 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader), 
0xe00009c97 (n.zxid), 0x11 (n.round), 21 (recipient), 20 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader), 
0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 20 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader), 
0xe00009c97 (n.zxid), 0x11 (n.round), 11 (recipient), 20 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader), 
0xe00009c97 (n.zxid), 0x11 (n.round), 12 (recipient), 20 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=21, proposed 
leader=21, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=10, proposed 
leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=10, proposed 
leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11{code}
These are just some of the lines from the log relevant to the issue and 
complete logs from all the servers are attached as well as the server 
configuration.

As it can be seen from the logs the server with ID 20 is sending notifications 
to all cluster members but in the logs from server 10 there is no indication 
that a notification from server 20 is received. However, it is obvious from the 
logs that server 20 received notifications from server 10 at the same time. 
This part is a bit strange since the same socket is used for communication in 
both directions.
The first suspect was the network, but after investigation by following tcpdump 
we could not find any packages leaving from the cluster where the Zk server 20 
is located which is sending notifications.

About almost half a minute later, the server realizes that there is no 
connection to Zk servers 10 and 11 and then reopens channels towards them. 
After that, the notifications are exchanged and the quorum is formed quickly 
after:
{code:java}
Nov 22 10:31:40 sc_2_1 BC[myid: 20]: WARN  Connection broken for id 11, my id = 
20, error = #012java.net.SocketException: Connection timed out (Read 
failed)#012#011at java.net.SocketInputStream.socketRead0(Native Method) 
~[?:1.8.0_261]#012#011at java.net.SocketInputStream.socketRead(Unknown Source) 
~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown Source) 
~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown Source) 
~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.fill(Unknown Source) 
~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.read(Unknown Source) 
~[?:1.8.0_261]#012#011at java.io.DataInputStream.readInt(Unknown Source) 
~[?:1.8.0_261]#012#011at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
 [zookeeper-3.4.10.jar:3.4.10-5f5438876cc9050ce4380a2780ce07d063aed8d7]
Nov 22 10:31:40 sc_2_1 BC[myid: 20]: WARN  Connection broken for id 10, my id = 
20, error = #012java.net.SocketException: Connection timed out (Read 
failed)#012#011at java.net.SocketInputStream.socketRead0(Native Method) 
~[?:1.8.0_261]#012#011at java.net.SocketInputStream.socketRead(Unknown Source) 
~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown Source) 
~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown Source) 
~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.fill(Unknown Source) 
~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.read(Unknown Source) 
~[?:1.8.0_261]#012#011at java.io.DataInputStream.readInt(Unknown Source) 
~[?:1.8.0_261]#012#011at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
 [zookeeper-3.4.10.jar:3.4.10-5f5438876cc9050ce4380a2780ce07d063aed8d7]
.....
Nov 22 10:31:41 sc_2_1 BC[myid: 20]: INFO  FOLLOWING
Nov 22 10:31:41 sc_2_1 BC[myid: 20]: INFO  Created server with tickTime 1500 
minSessionTimeout 3000 maxSessionTimeout 30000 datadir /local/version-2 snapdir 
/local/version-2
Nov 22 10:31:41 sc_2_1 BC[myid: 20]: INFO  FOLLOWING - LEADER ELECTION TOOK - 
27333{code}
We have two main questions related to this issue if you could help:

1) Is there any way to know for sure that the vote notification was actually 
sent other than the following log:
{code:java}
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader), 
0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 20 (myid), 0xe 
(n.peerEpoch){code}
Or is there any chance that the receiving server would discard the notification 
for some reason and not even log the event, since we cannot see anything in the 
logs even with debug enabled?

2) Are there any options to reduce this timeout since half a minute for leader 
election is too much for our highly available product with only 5 zookeeper 
servers? Is there any configuration parameter that could be tuned?

Also, do you know if the same or similar issue has been already reported or 
fixed in some of the later versions of the product? We could not find something 
similar in Zookeeper Jira.

 
Thanks and regards,
Ivo
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to