[
https://issues.apache.org/jira/browse/ZOOKEEPER-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060200#comment-17060200
]
Mate Szalay-Beko commented on ZOOKEEPER-3756:
---------------------------------------------
OK, I have a theory... Maybe this is what happens:
- After shutting down the leader, the whole leader election restarts
- ZooKeeper tries to open socket connection to the other ZooKeeper servers by
using synchronized methods, so only one can run a time (see on the master
branch:
https://github.com/apache/zookeeper/blob/a5a4743733b8939464af82c1ee68a593fadbe362/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L688
and
https://github.com/apache/zookeeper/blob/a5a4743733b8939464af82c1ee68a593fadbe362/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L759)
- the default timeout is 5 secs (this is why there is nothing leader election
related log message in your log files for 5 sec, until we hit the timeout of
socket open to server 3)
- by the time when the 5 sec timeout elapsed, the leader election protocol was
also timeouted (but AFAIK it is increasing its internal timeout always? I will
need to verify this)
- after this happens a few time, either the leader election protocol timeout is
increased enough to be able tolerate the 5 sec delay (and/or the fact that the
server-3 restarted and the socket can be opened now) will cause that this block
gets removed and everything goes smoothly after this. But it took 30 seconds,
what is way too long...
The question is, why the socket needs to timeout (wait for 5 sec) and why the
connection doesn't get closed immediately with some 'host unreachable'
exception, what we would expect in case if the server goes down and no IP
connection can be established. Usually we don't see this problem in production,
so I guess it has to do something with Kubernetes networking.
Still, this part needs to be refactored in ZooKeeper, we have to make the
{{connectOne}} asynchronous, what is not an easy task. Actually this is also
something which was suggested in ZOOKEEPER-2164 (but in that ticket there were
other errors fixed in the end).
In the meanwhile there might be some workarounds:
# you can decrease the connection timeout to e.g. 500ms or 1000ms using the
{{-Dzookeeper.cnxTimeout=500'}} system property. I am not sure if it will help,
but I would be glad if you could test it
# an other independent workaround would be using the multiAddress feature of
ZooKeeper 3.6.0, enabling it by {{-Dzookeeper.multiAddress.enabled=true}}. Then
ZooKeeper should periodically check the availability of the currently used
election addresses and kill the socket if the host is unavailable. This way we
might kill the dead socket before the timeout happen. However, it might run
ICMP traffic (ping) in the background, which I am not sure if will be reliable
in kubernetes.
No matter if the workarounds would fix the problem for you or not, I would
suggest to keep this ticket open, and I will try to implement an asynchronous
connection establishment somehow.
> Members failing to rejoin quorum
> --------------------------------
>
> Key: ZOOKEEPER-3756
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3756
> Project: ZooKeeper
> Issue Type: Improvement
> Components: leaderElection
> Affects Versions: 3.5.6, 3.5.7
> Reporter: Dai Shi
> Assignee: Mate Szalay-Beko
> Priority: Major
> Attachments: Dockerfile, configmap.yaml, docker-entrypoint.sh,
> jmx.yaml, zoo-0.log, zoo-1.log, zoo-2.log, zoo-service.yaml, zookeeper.yaml
>
>
> Not sure if this is the place to ask, please close if it's not.
> I am seeing some behavior that I can't explain since upgrading to 3.5:
> In a 5 member quorum, when server 3 is the leader and each server has this in
> their configuration:
> {code:java}
> server.1=100.71.255.254:2888:3888:participant;2181
> server.2=100.71.255.253:2888:3888:participant;2181
> server.3=100.71.255.252:2888:3888:participant;2181
> server.4=100.71.255.251:2888:3888:participant;2181
> server.5=100.71.255.250:2888:3888:participant;2181{code}
> If servers 1 or 2 are restarted, they fail to rejoin the quorum with this in
> the logs:
> {code:java}
> 2020-03-11 20:23:35,720 [myid:2] - INFO
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] -
> LOOKING
> 2020-03-11 20:23:35,721 [myid:2] - INFO
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885]
> - New election. My id = 2, proposed zxid=0x1b8005f4bba
> 2020-03-11 20:23:35,733 [myid:2] - INFO
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier,
> so dropping the connection: (3, 2)
> 2020-03-11 20:23:35,734 [myid:2] - INFO
> [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection
> request 100.126.116.201:36140
> 2020-03-11 20:23:35,735 [myid:2] - INFO
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier,
> so dropping the connection: (4, 2)
> 2020-03-11 20:23:35,740 [myid:2] - INFO
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier,
> so dropping the connection: (5, 2)
> 2020-03-11 20:23:35,740 [myid:2] - INFO
> [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection
> request 100.126.116.201:36142
> 2020-03-11 20:23:35,740 [myid:2] - INFO
> [WorkerReceiver[myid=2]:FastLeaderElection@679] - Notification: 2 (message
> format version), 2 (n.leader), 0x1b8005f4bba (n.zxid), 0x1 (n.round), LOOKING
> (n.state), 2 (n.sid), 0x1b8 (n.peerEPoch), LOOKING (my state)0 (n.config
> version)
> 2020-03-11 20:23:35,742 [myid:2] - WARN
> [SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted while waiting
> for message on queue
> java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> at
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131)
> 2020-03-11 20:23:35,744 [myid:2] - WARN
> [SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker leaving thread
> id 3 my id = 2
> 2020-03-11 20:23:35,745 [myid:2] - WARN
> [RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting
> SendWorker{code}
> The only way I can seem to get them to rejoin the quorum is to restart the
> leader.
> However, if I remove server 4 and 5 from the configuration of server 1 or 2
> (so only servers 1, 2, and 3 remain in the configuration file), then they can
> rejoin the quorum fine. Is this expected and am I doing something wrong? Any
> help or explanation would be greatly appreciated. Thank you.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)