anmolnar edited a comment on issue #1048: ZOOKEEPER-3188: Improve resilience to network URL: https://github.com/apache/zookeeper/pull/1048#issuecomment-540011376 I uploaded the logs of the failing Follower here: https://pastebin.com/LsXYiRKt It was running on a Mac and the situation was as previously described: 1. 2 interfaces was running: wifi and cable, 2. cable plugged out, 3. wifi got disabled, cable plugged in After the 3rd step we had to wait approximately 1 minute for the quorum to get up again. We believe that it was because at the first exception: ``` 2019-10-09 13:49:43,744 [myid:1] - WARN [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled):Follower@127] - Exception when following the leader java.net.SocketTimeoutException: Read timed out ``` Follower shuts down, restarting the leader election, but `QuorumCnxnManager` still believes the connections are still up. After a minute it finally gets SocketException here: ``` 2019-10-09 13:50:37,709 [myid:1] - WARN [RecvWorker:3:QuorumCnxManager$RecvWorker@1336] - Connection broken for id 3, my id = 1, error = java.net.SocketException: Operation timed out (Read failed) ``` and shuts down all Senc/Recv workers. This is because the read timeout on that socket is infinite to prevent the leader election port shutdown when no traffic is transmitted. At this point the leader election raised the notification timeout to approx. 1 minute, so we have to wait for notifications to be resent quite long. If only a single node is failing, the quorum is still up, so I believe it's not a big deal. But if we think about an entire switch failure which could shutdown the entire ensemble at the same time, this could be too long to recover.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
