[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060200#comment-17060200
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3756:
---------------------------------------------

OK, I have a theory... Maybe this is what happens:
- After shutting down the leader, the whole leader election restarts
- ZooKeeper tries to open socket connection to the other ZooKeeper servers by 
using synchronized methods, so only one can run a time (see  on the master 
branch: 
https://github.com/apache/zookeeper/blob/a5a4743733b8939464af82c1ee68a593fadbe362/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L688
 and 
https://github.com/apache/zookeeper/blob/a5a4743733b8939464af82c1ee68a593fadbe362/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L759)
- the default timeout is 5 secs (this is why there is nothing leader election 
related log message in your log files for 5 sec, until we hit the timeout of 
socket open to server 3)
- by the time when the 5 sec timeout elapsed, the leader election protocol was 
also timeouted (but AFAIK it is increasing its internal timeout always? I will 
need to verify this)
- after this happens a few time, either the leader election protocol timeout is 
increased enough to be able tolerate the 5 sec delay (and/or the fact that the 
server-3 restarted and the socket can be opened now) will cause that this block 
gets removed and everything goes smoothly after this. But it took 30 seconds, 
what is way too long...

The question is, why the socket needs to timeout (wait for 5 sec) and why the 
connection doesn't get closed immediately with some 'host unreachable' 
exception, what we would expect in case if the server goes down and no IP 
connection can be established. Usually we don't see this problem in production, 
so I guess it has to do something with Kubernetes networking.

Still, this part needs to be refactored in ZooKeeper, we have to make the 
{{connectOne}} asynchronous, what is not an easy task. Actually this is also 
something which was suggested in ZOOKEEPER-2164 (but in that ticket there were 
other errors fixed in the end). 

In the meanwhile there might be some workarounds:
# you can decrease the connection timeout to e.g. 500ms or 1000ms using the 
{{-Dzookeeper.cnxTimeout=500'}} system property. I am not sure if it will help, 
but I would be glad if you could test it
# an other independent workaround would be using the multiAddress feature of 
ZooKeeper 3.6.0, enabling it by {{-Dzookeeper.multiAddress.enabled=true}}. Then 
ZooKeeper should periodically check the availability of the currently used 
election addresses and kill the socket if the host is unavailable. This way we 
might kill the dead socket before the timeout happen. However, it might run 
ICMP traffic (ping) in the background, which I am not sure if will be reliable 
in kubernetes.

No matter if the workarounds would fix the problem for you or not, I would 
suggest to keep this ticket open, and I will try to implement an asynchronous 
connection establishment somehow.

> Members failing to rejoin quorum
> --------------------------------
>
>                 Key: ZOOKEEPER-3756
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3756
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: leaderElection
>    Affects Versions: 3.5.6, 3.5.7
>            Reporter: Dai Shi
>            Assignee: Mate Szalay-Beko
>            Priority: Major
>         Attachments: Dockerfile, configmap.yaml, docker-entrypoint.sh, 
> jmx.yaml, zoo-0.log, zoo-1.log, zoo-2.log, zoo-service.yaml, zookeeper.yaml
>
>
> Not sure if this is the place to ask, please close if it's not.
> I am seeing some behavior that I can't explain since upgrading to 3.5:
> In a 5 member quorum, when server 3 is the leader and each server has this in 
> their configuration: 
> {code:java}
> server.1=100.71.255.254:2888:3888:participant;2181
> server.2=100.71.255.253:2888:3888:participant;2181
> server.3=100.71.255.252:2888:3888:participant;2181
> server.4=100.71.255.251:2888:3888:participant;2181
> server.5=100.71.255.250:2888:3888:participant;2181{code}
> If servers 1 or 2 are restarted, they fail to rejoin the quorum with this in 
> the logs:
> {code:java}
> 2020-03-11 20:23:35,720 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] - 
> LOOKING
> 2020-03-11 20:23:35,721 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885]
>  - New election. My id =  2, proposed zxid=0x1b8005f4bba
> 2020-03-11 20:23:35,733 [myid:2] - INFO  
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, 
> so dropping the connection: (3, 2)
> 2020-03-11 20:23:35,734 [myid:2] - INFO  
> [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection 
> request 100.126.116.201:36140
> 2020-03-11 20:23:35,735 [myid:2] - INFO  
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, 
> so dropping the connection: (4, 2)
> 2020-03-11 20:23:35,740 [myid:2] - INFO  
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, 
> so dropping the connection: (5, 2)
> 2020-03-11 20:23:35,740 [myid:2] - INFO  
> [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection 
> request 100.126.116.201:36142
> 2020-03-11 20:23:35,740 [myid:2] - INFO  
> [WorkerReceiver[myid=2]:FastLeaderElection@679] - Notification: 2 (message 
> format version), 2 (n.leader), 0x1b8005f4bba (n.zxid), 0x1 (n.round), LOOKING 
> (n.state), 2 (n.sid), 0x1b8 (n.peerEPoch), LOOKING (my state)0 (n.config 
> version)
> 2020-03-11 20:23:35,742 [myid:2] - WARN  
> [SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted while waiting 
> for message on queue
> java.lang.InterruptedException
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>         at 
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131)
> 2020-03-11 20:23:35,744 [myid:2] - WARN  
> [SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker leaving thread  
> id 3 my id = 2
> 2020-03-11 20:23:35,745 [myid:2] - WARN  
> [RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting 
> SendWorker{code}
> The only way I can seem to get them to rejoin the quorum is to restart the 
> leader.
> However, if I remove server 4 and 5 from the configuration of server 1 or 2 
> (so only servers 1, 2, and 3 remain in the configuration file), then they can 
> rejoin the quorum fine. Is this expected and am I doing something wrong? Any 
> help or explanation would be greatly appreciated. Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to