Swathi Mocharla created ZOOKEEPER-4842:
------------------------------------------

             Summary: Zookeeper quorum is not formed intermittently with 
trailing dot in the cluster domain name
                 Key: ZOOKEEPER-4842
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4842
             Project: ZooKeeper
          Issue Type: Bug
          Components: quorum
    Affects Versions: 3.8.4
            Reporter: Swathi Mocharla


On kubernetes, we've set up the cluster domain with a trailing dot. Doing so, 
we are seeing very often that the zookeeper quorum itself is not being 
established. 

 
{code:java}
bash-4.4$ env -u KAFKA_OPTS zookeeper-shell localhost:2181 config
Connecting to localhost:2181
[2024-06-25 10:36:39,178] WARN Client session timed out, have not heard from 
server in 30031ms for session id 0x0 (org.apache.zookeeper.ClientCnxn)
[2024-06-25 10:36:39,182] WARN Session 0x0 for server 
localhost/[0:0:0:0:0:0:0:1]:2181, Closing socket connection. Attempting 
reconnect except it is a SessionExpiredException. 
(org.apache.zookeeper.ClientCnxn)
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed 
out, have not heard from server in 30031ms for session id 0x0
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1257)
KeeperErrorCode = ConnectionLoss for /zookeeper/config
 
{code}
 

In the zookeeper logs, we see a lot of IOExceptions,  UnknownHost and 
Interrupted exceptions.

 
{code:java}
java.io.IOException: ZooKeeperServer not running
        at 
org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:565)
        at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:350)
        at 
org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
        at 
org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)
        at java.base/java.lang.Thread.run(Unknown Source)
{"type":"log", "host":"zk-swkf-2.default", "level":"WARN", 
"systemid":"zookeeper-2b13339237454984887b4908dc3a6df0", "system":"zookeeper", 
"time":"2024-06-25T10:23:16.325Z", "timezone":"UTC", 
"log":{"message":"NIOWorkerThread-1 - org.apache.zookeeper.server.NIOServerCnxn 
- Close of session 0x0"}}
 
java.lang.InterruptedException
        at 
java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown
 Source)
        at 
org.apache.zookeeper.util.CircularBlockingQueue.poll(CircularBlockingQueue.java:105)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1453)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$900(QuorumCnxManager.java:99)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1277)
{code}
 

 

this is the content of the /etc/resolve.conf
{code:java}
bash-4.4$ cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local bcmt
nameserver 10.254.0.10
options ndots:5{code}
 

 
{code:java}
[root@vm-10-76-72-33 ckaf-kafka]# nslookup zk-swkf.default.svc.cluster.local.
Server:         10.76.72.33
Address:        10.76.72.33#53
Name:   zk-swkf.default.svc.cluster.local
Address: 10.254.94.24
[root@vm-10-76-72-33 ckaf-kafka]# nslookup zk-swkf.default.svc.cluster.local
Server:         10.76.72.33
Address:        10.76.72.33#53
Name:   zk-swkf.default.svc.cluster.local
Address: 10.254.94.24
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to