[
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371452#comment-16371452
]
Flavio Junqueira commented on ZOOKEEPER-2982:
---------------------------------------------
>From the logs, we can see the same exception being raised when the server is
>trying to connect to elect a leader:
{noformat}
2018-02-20 20:41:25,669 [myid:1] - WARN
[QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer$QuorumServer@173]
- Failed to resolve address:
pravega-zookeeper-2.pravega-zookeeper-headless.default.svc.cluster.local
java.net.UnknownHostException:
pravega-zookeeper-2.pravega-zookeeper-headless.default.svc.cluster.local
at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)
at
org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:171)
at
org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:727)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:682)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:716)
at
org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:919)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1190)
{noformat}
Once the address resolves and it can connect, the exception goes away and the
notification messages flow regularly. The question is why the update performed
during leader election to the quorum view in {{QuorumCnxManager.connectOne}} is
not taking any effect in the view that {{Learner.findLeader}} uses to get the
`QuorumServer` instance to connect to the leader. Two possibilities I can think
of:
1- The server hasn't connected to the elected server during leader election, in
which case the address wasn't updated.
2- The quorum view that the learner is using to get the quorum server instance
is not the one that was updated in {{QuorumCnxManager}}.
> Re-try DNS hostname -> IP resolution
> ------------------------------------
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.5.0, 3.5.1, 3.5.3
> Reporter: Eron Wright
> Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4. Some portions of the fix
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started
> before all peer addresses are resolvable, that server may cache a negative
> lookup result and forever fail to resolve the address. For example,
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless)
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
> - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not
> resolvable when the server started, but became resolvable shortly thereafter.
> The server should eventually succeed but doesn't.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)