Brian Lininger created ZOOKEEPER-2849:
-----------------------------------------

             Summary: Quorum port binding needs exponential back-off retry
                 Key: ZOOKEEPER-2849
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2849
             Project: ZooKeeper
          Issue Type: Improvement
          Components: quorum
    Affects Versions: 3.5.3, 3.4.6
            Reporter: Brian Lininger
            Priority: Minor


Recently we upgraded the AWS instance type we use for running out ZooKeeper 
nodes, and by doing so we're intermittently hitting an issue where ZooKeeper 
cannot bind to the server election port because the IP is incorrect.  This is 
due to name resolution in Route53 not being in sync when ZooKeeper starts on 
the more powerful EC2 instances.  Currently in QuorumCnxManager.Listener, we 
only attempt to bind 3 times with a 1s sleep between retries, which is not long 
enough.  

I'm proposing to change this to follow an exponential back-off type strategy 
where each failed attempt causes a longer sleep between retry attempts.  This 
would allow for Zookeeper to gracefully recover when the host is misconfigured, 
and subsequently corrected, without requiring the process to be restarted while 
also minimizing the impact to the running instance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to