Luke Chen created ZOOKEEPER-4728:
------------------------------------

             Summary: Zookeepr cannot bind to itself forever if DNS is not 
ready when startup
                 Key: ZOOKEEPER-4728
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4728
             Project: ZooKeeper
          Issue Type: Sub-task
    Affects Versions: 3.6.4
            Reporter: Luke Chen


Note: This issue also happened in the latest `master` branch

 

When the leader tried to bind the host/IP to get connection from followers, if 
the DNS is not ready at first, it'll always stay in {{<unresolved>}} state 
forever. The error log is like this:

 
{code:java}
2023-07-26 00:25:25,251 ERROR Couldn't bind to localhost1/<unresolved>:2888 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]java.net.SocketException:
 Unresolved address    at 
java.base/java.net.ServerSocket.bind(ServerSocket.java:380)    at 
java.base/java.net.ServerSocket.bind(ServerSocket.java:342)    at 
org.apache.zookeeper.server.quorum.Leader.createServerSocket(Leader.java:315)   
 at org.apache.zookeeper.server.quorum.Leader.lambda$new$0(Leader.java:294)    
at 
java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
    at 
java.base/java.util.concurrent.ConcurrentHashMap$KeySpliterator.forEachRemaining(ConcurrentHashMap.java:3573)
    at 
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) 
   at 
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
    at 
java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
    at 
java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
    at 
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) 
   at 
java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
    at org.apache.zookeeper.server.quorum.Leader.<init>(Leader.java:297)    at 
org.apache.zookeeper.server.quorum.QuorumPeer.makeLeader(QuorumPeer.java:1272)  
  at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1479)2023-07-26
 00:25:25,252 WARN Unexpected exception 
(org.apache.zookeeper.server.quorum.QuorumPeer) 
[QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]java.io.IOException:
 Leader failed to initialize any of the following sockets: 
[metrics-cluster-1-zookeeper-0.metrics-cluster-1-zookeeper-nodes.metrics-test-1.svc/<unresolved>:2888]
    at org.apache.zookeeper.server.quorum.Leader.<init>(Leader.java:300)    at 
org.apache.zookeeper.server.quorum.QuorumPeer.makeLeader(QuorumPeer.java:1272)  
  at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1479) 
{code}
 

 

This repeatedly appear and never successfully bind to the address, so the 
quorum never formed.

 

Reproduce steps:

1. setup 1 zookeeper node, and set the zookeeper connection config as:
{code:java}
server.1=localhost1:2888:3888{code}
Note, it's "localhost1"

2. startup the zookeeper node, it'll show the `Exception while listening` error 
, as well as the `Couldn't bind to localhost1/<unresolved>:2888 ` error like 
above. This is to simulate the DNS is not ready when zookeeper startup. It's 
quite common in k8s environment.

3. edit /etc/hosts, map `localhost1` into `127.0.0.1`

4. You can see the log, the `Exception while listening` error is gone, but 
`Couldn't bind to localhost1/<unresolved>:2888 ` still keeps appearing, and the 
quorum never formed.

 

Note: The `Exception while listening` can be self-healing is because it 
re-resolve the hostname each time it tried to bind the hostname. So we should 
apply the same solution to the leader binding.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to