[ https://issues.apache.org/jira/browse/ZOOKEEPER-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17748122#comment-17748122 ]
Luke Chen commented on ZOOKEEPER-4728: -------------------------------------- PR: [https://github.com/apache/zookeeper/pull/2040] raised to fix partial of this issue: force to re-resolve hostname into IP when binding (i.e. ZOOKEEPER-4728) > Zookeepr cannot bind to itself forever if DNS is not ready when startup > ----------------------------------------------------------------------- > > Key: ZOOKEEPER-4728 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4728 > Project: ZooKeeper > Issue Type: Sub-task > Affects Versions: 3.6.4 > Reporter: Luke Chen > Priority: Major > > Note: This issue also happened in the latest `master` branch > > When the leader tried to bind the host/IP to get connection from followers, > if the DNS is not ready at first, it'll always stay in {{<unresolved>}} state > forever. The error log is like this: > > {code:java} > 2023-07-26 00:25:25,251 ERROR Couldn't bind to localhost1/<unresolved>:2888 > (org.apache.zookeeper.server.quorum.Leader) > [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]java.net.SocketException: > Unresolved address at > java.base/java.net.ServerSocket.bind(ServerSocket.java:380) at > java.base/java.net.ServerSocket.bind(ServerSocket.java:342) at > org.apache.zookeeper.server.quorum.Leader.createServerSocket(Leader.java:315) > at org.apache.zookeeper.server.quorum.Leader.lambda$new$0(Leader.java:294) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) > at > java.base/java.util.concurrent.ConcurrentHashMap$KeySpliterator.forEachRemaining(ConcurrentHashMap.java:3573) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) > at > java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596) > at org.apache.zookeeper.server.quorum.Leader.<init>(Leader.java:297) > at > org.apache.zookeeper.server.quorum.QuorumPeer.makeLeader(QuorumPeer.java:1272) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1479)2023-07-26 > 00:25:25,252 WARN Unexpected exception > (org.apache.zookeeper.server.quorum.QuorumPeer) > [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]java.io.IOException: > Leader failed to initialize any of the following sockets: > [metrics-cluster-1-zookeeper-0.metrics-cluster-1-zookeeper-nodes.metrics-test-1.svc/<unresolved>:2888] > at org.apache.zookeeper.server.quorum.Leader.<init>(Leader.java:300) > at > org.apache.zookeeper.server.quorum.QuorumPeer.makeLeader(QuorumPeer.java:1272) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1479) {code} > > > This repeatedly appear and never successfully bind to the address, so the > quorum never formed. > > Reproduce steps: > 1. setup 1 zookeeper node, and set the zookeeper connection config as: > {code:java} > server.1=localhost1:2888:3888{code} > Note, it's "localhost1" > 2. startup the zookeeper node, it'll show the `Exception while listening` > error , as well as the `Couldn't bind to localhost1/<unresolved>:2888 ` error > like above. This is to simulate the DNS is not ready when zookeeper startup. > It's quite common in k8s environment. > 3. edit /etc/hosts, map `localhost1` into `127.0.0.1` > 4. You can see the log, the `Exception while listening` error is gone, but > `Couldn't bind to localhost1/<unresolved>:2888 ` still keeps appearing, and > the quorum never formed. > > Note: The `Exception while listening` can be self-healing is because it > re-resolve the hostname each time it tried to bind the hostname. So we should > apply the same solution to the leader binding. -- This message was sent by Atlassian Jira (v8.20.10#820010)