Flavio Junqueira commented on ZOOKEEPER-2982:

I have tried your recipe for reproducing as well [~andorm] by changing 
{{/etc/hosts}} and got the same issue. The problem is that the leader fails to 
bind to the port, which actually makes me wonder whether we need to do anything 
about the leader with respect to this issue:

java.net.SocketException: Unresolved address
        at java.net.ServerSocket.bind(ServerSocket.java:368)
        at java.net.ServerSocket.bind(ServerSocket.java:329)
        at org.apache.zookeeper.server.quorum.Leader.<init>(Leader.java:240)

Your suggestion of the alternative change is sensible, but I'd say that for 
consistency, it is better that we simply do the same that we have in 3.4, which 
is to make the change in {{findLeader}}.

One thing that I believe we haven't been able to do is to have a test case to 
report it. It would be good to have it, but I'm not sure what would be a good 

> Re-try DNS hostname -> IP resolution
> ------------------------------------
>                 Key: ZOOKEEPER-2982
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.0, 3.5.1, 3.5.3
>            Reporter: Eron Wright 
>            Priority: Blocker
>             Fix For: 3.5.4, 3.6.0
>         Attachments: 3.5.3-beta.zip, fixed.log
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.    For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
>         at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:589)
>         at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
>         at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
>         at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>    The server should eventually succeed but doesn't.

This message was sent by Atlassian JIRA

Reply via email to