Swathi Mocharla created ZOOKEEPER-4842: ------------------------------------------
Summary: Zookeeper quorum is not formed intermittently with trailing dot in the cluster domain name Key: ZOOKEEPER-4842 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4842 Project: ZooKeeper Issue Type: Bug Components: quorum Affects Versions: 3.8.4 Reporter: Swathi Mocharla On kubernetes, we've set up the cluster domain with a trailing dot. Doing so, we are seeing very often that the zookeeper quorum itself is not being established. {code:java} bash-4.4$ env -u KAFKA_OPTS zookeeper-shell localhost:2181 config Connecting to localhost:2181 [2024-06-25 10:36:39,178] WARN Client session timed out, have not heard from server in 30031ms for session id 0x0 (org.apache.zookeeper.ClientCnxn) [2024-06-25 10:36:39,182] WARN Session 0x0 for server localhost/[0:0:0:0:0:0:0:1]:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. (org.apache.zookeeper.ClientCnxn) org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 30031ms for session id 0x0 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1257) KeeperErrorCode = ConnectionLoss for /zookeeper/config {code} In the zookeeper logs, we see a lot of IOExceptions, UnknownHost and Interrupted exceptions. {code:java} java.io.IOException: ZooKeeperServer not running at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:565) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:350) at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) {"type":"log", "host":"zk-swkf-2.default", "level":"WARN", "systemid":"zookeeper-2b13339237454984887b4908dc3a6df0", "system":"zookeeper", "time":"2024-06-25T10:23:16.325Z", "timezone":"UTC", "log":{"message":"NIOWorkerThread-1 - org.apache.zookeeper.server.NIOServerCnxn - Close of session 0x0"}} java.lang.InterruptedException at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source) at org.apache.zookeeper.util.CircularBlockingQueue.poll(CircularBlockingQueue.java:105) at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1453) at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$900(QuorumCnxManager.java:99) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1277) {code} this is the content of the /etc/resolve.conf {code:java} bash-4.4$ cat /etc/resolv.conf search default.svc.cluster.local svc.cluster.local cluster.local bcmt nameserver 10.254.0.10 options ndots:5{code} {code:java} [root@vm-10-76-72-33 ckaf-kafka]# nslookup zk-swkf.default.svc.cluster.local. Server: 10.76.72.33 Address: 10.76.72.33#53 Name: zk-swkf.default.svc.cluster.local Address: 10.254.94.24 [root@vm-10-76-72-33 ckaf-kafka]# nslookup zk-swkf.default.svc.cluster.local Server: 10.76.72.33 Address: 10.76.72.33#53 Name: zk-swkf.default.svc.cluster.local Address: 10.254.94.24 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)