[
https://issues.apache.org/jira/browse/ZOOKEEPER-3320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andor Molnar reassigned ZOOKEEPER-3320:
---------------------------------------
Assignee: Igor Skokov
> Leader election port stop listen when hostname unresolvable for some time
> --------------------------------------------------------------------------
>
> Key: ZOOKEEPER-3320
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3320
> Project: ZooKeeper
> Issue Type: Bug
> Components: leaderElection
> Affects Versions: 3.4.10, 3.5.4
> Reporter: Igor Skokov
> Assignee: Igor Skokov
> Priority: Major
> Labels: pull-request-available
> Time Spent: 8h 20m
> Remaining Estimate: 0h
>
> When trying to run Zookeeper 3.5.4 cluster on Kubernetes, I found out that in
> some circumstances Zookeeper node stop listening on leader election port.
> This cause unavailability of ZK cluster.
> Zookeeper deployed as StatefulSet in term of Kubernetes and has following
> dynamic configuration:
> {code:java}
> zookeeper-0.zookeeper:2182:2183:participant;2181
> zookeeper-1.zookeeper:2182:2183:participant;2181
> zookeeper-2.zookeeper:2182:2183:participant;2181
> {code}
> Bind address contains DNS name which generated by Kubernetes for each
> StatefulSet pod.
> These DNS names will become resolvable after container start, but with some
> delay. That delay cause stopping of leader election port listener in
> QuorumCnxManager.Listener class.
> Error happens in QuorumCnxManager.Listener "run" method, it tries to bind
> leader election port to hostname which not resolvable at this moment. Retry
> count is hard-coded and equals to 3(with backoff of 1 sec).
> Zookeeper server log contains following errors:
> {code:java}
> 2019-03-17 07:56:04,844 [myid:1] - WARN
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1230] -
> Unexpected exception
> java.net.SocketException: Unresolved address
> at java.base/java.net.ServerSocket.bind(ServerSocket.java:374)
> at java.base/java.net.ServerSocket.bind(ServerSocket.java:335)
> at org.apache.zookeeper.server.quorum.Leader.<init>(Leader.java:241)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.makeLeader(QuorumPeer.java:1023)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1226)
> 2019-03-17 07:56:04,844 [myid:1] - WARN
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1261] -
> PeerState set to LOOKING
> 2019-03-17 07:56:04,845 [myid:1] - INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1136] -
> LOOKING
> 2019-03-17 07:56:04,845 [myid:1] - INFO
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):FastLeaderElection@893]
> - New election. My id = 1, proposed zxid=0x0
> 2019-03-17 07:56:04,846 [myid:1] - INFO
> [WorkerReceiver[myid=1]:FastLeaderElection@687] - Notification: 2 (message
> format version), 1 (n.leader), 0x0 (n.zxid), 0xf (n.round), LOOKING
> (n.state), 1 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config
> version)
> 2019-03-17 07:56:04,979 [myid:1] - INFO
> [zookeeper-0.zookeeper:2183:QuorumCnxManager$Listener@892] - Leaving listener
> 2019-03-17 07:56:04,979 [myid:1] - ERROR
> [zookeeper-0.zookeeper:2183:QuorumCnxManager$Listener@894] - As I'm leaving
> the listener thread, I won't be able to participate in leader election any
> longer: zookeeper-0.zookeeper:2183
> {code}
> This error happens on most nodes on cluster start and Zookeeper is unable to
> form quorum. This will leave cluster in unusable state.
> As I can see, error present on branches 3.4 and 3.5.
> I think, this error can be fixed by configurable number of retries(instead of
> hard-coded value of 3).
> Other way to fix this is removing of max retries at all. Currently, ZK server
> only stop leader election listener and continue to serve on other ports.
> Maybe, if leader election halts, we should abort process.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)