symat opened a new pull request #1289: ZOOKEEPER-3756: Members slow to rejoin 
quorum using Kubernetes
URL: https://github.com/apache/zookeeper/pull/1289
 
 
   Whenever we close the current master ZooKeeper server, a new leader election
   is triggered. During the new election, a connection will be established 
between
   all the servers, by calling the synchronized 'connectOne' method in
   QuorumCnxManager. The method will open the socket and send a single small
   initial message to the other server, usually very quickly. If the destination
   host is unreachable, it should fail immediately.
   
   However, when we use Kubernetes, then the destination host is always 
reachable
   as it points to Kubernetes services. If the actual container / pod is not
   available then the 'socket.connect' method will timeout (by default after 5 
sec)
   instead of failing immediately with NoRouteToHostException. As the 
'connectOne'
   method is synchronized, this timeout will block the creation of other
   connections, so a single unreachable host can cause timeout in the leader
   election protocol.
   
   One workaround is to decrease the socket connection timeout with the
   '-Dzookeeper.cnxTimeout' stystem property, but the proper fix would be to
   make the connection initiation fully asynchronous, as using very low timeout 
can
   have its own side effect. Fortunately most of the initial message sending
   is already made async: the SASL authentication can take more time, so the
   second (authentication + initial message sending) part of the initiation 
protocol 
   is already called in a separate thread, when Quorum SASL authentication is 
enabled.
   
   In the following patch I made the whole connection initiation async, by
   always using the async executor (not only when Quorum SASL is enabled) and
   also moving the socket.connect call into the async thread.
   
   I also created a unit test to verify my fix. I added a static socket factory 
that can be 
   changed by the tests using a packet private setter method. My test failed 
(and
   produced the same error logs as we see in the original Jira ticket) before I 
applied
   my changes and a time-outed as no leader election succeeded after 15 seconds.
   After the changes the test runs very quickly, in 1-2 seconds.
   
   Note: due to the multiAddress changes, we will need different PRs to the 
branch 3.5 
   and to the 3.6+ branches. I will submit the other PR once this got reviewed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to