[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335904#comment-14335904
 ] 

Mark Duske commented on ZOOKEEPER-1846:
---------------------------------------

I had the same issue but thanks to the log messages I was able to find and fix 
at the source, the situation when a IP address is still unavailable or changes 
all the sudden is actually VERY common in most major cloud solutions nowadays, 
what actually makes this bug SEVERE for the high availability does not work at 
all in such DNS changes, even overriding the DNS caching setting in Java is 
useless.

This issue affects, at least, the following methods:
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71)
connectOne(QuorumCnxManager.java:369)

And the solution was to add a call to a new method located in:
org.apache.zookeeper.server.quorum.QuorumPeer.QuorumServer.refreshDNS()

And call it upon a IOException on the methods listed above, now I can start 
nodes anytime (even saving my DNS container for last), restart them (IPs are 
always different) and the negotiation works like a charm.

To say this issue is "Minor" is not understandable for nowadays all sorts of 
systems are virtualized and, most the time, IPs  address are dynamically 
assigned upon system startup and when multiple nodes start as the same time it 
is likely that a few moments will be needed for the IPs to be assigned... that 
is exactly what I go through with docker and Zookeeper.

This is the new method added 
(org.apache.zookeeper.server.quorum.QuorumPeer.QuorumServer.refreshDNS()):
        /**
         * Forces the resolution of the hostname to IP address, for this can be 
dynamic and on some
         * occasions even not available by the time the service starts
         */
        public void refreshDNS() {
                LOG.debug("Refreshing DNS for Quorum Peer " + electionAddr);
                
                if (electionAddr != null) {
                        electionAddr = new 
InetSocketAddress(electionAddr.getHostName(), electionAddr.getPort());
                }
                
                if (addr != null) {
                        addr = new InetSocketAddress(addr.getHostName(), 
addr.getPort());
                }

This is how the catch clauses of the affected methods look like now:
} catch (IOException e) {
                //This code corrects a negative or out-dated cache hits avoid 
the servers from communicating, by forcing it to resolve again of the Hostname 
to a IP Address
                self.getView().get(sid).refreshDNS();
                self.quorumPeers.get(sid).refreshDNS();
                
                LOG.warn("Cannot open channel to " + sid
                        + " at election address " + electionAddr,
                        e);
            }

> Cached InetSocketAddresses prevent proper dynamic DNS resolution
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1846
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1846
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.4.5
>            Reporter: Benjamin Jaton
>            Priority: Minor
>
> The class QuorumPeer maintains a Map<Long, QuorumServer> quorumPeers.
> Each QuorumServer is created with an instance of InetSocketAddress 
> electionAddr, and holds it forever.
> I believe this is why the ZooKeeper servers can't resolve each other 
> dynamically: If a ZooKeeper in the ensemble cannot be resolved at startup, it 
> will never be resolved (until restart of the JVM), constantly failing with an 
> UnknownHostException, even when the node is back up and reachable.
> I would suggest to recreate an InetSocketAddress every time we retry the 
> connection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to