Luke Chen created ZOOKEEPER-4724: ------------------------------------ Summary: follower can't connect to the right leader and quorum failed to form Key: ZOOKEEPER-4724 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4724 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.6.4 Reporter: Luke Chen
When entering "following - discovery" state, the follower will connect to the leader node to reach a quorum. But recently, a user faced the issue that the follower can't connect to the right leader and quorum failed to form. >From the log, I can see the follower is trying to connect to itself (0.0.0.0:2888), instead of the leader. After 5 retries, a new election started, and all the things happen again, that is, the node becomes a follower, and try to connect to itself, and again, and again... The log is like this: {code:java} 2023-07-25 06:47:54,982 INFO FOLLOWING - LEADER ELECTION TOOK - 9802 MS (org.apache.zookeeper.server.quorum.Learner) [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)] 2023-07-25 06:47:54,983 INFO Peer state changed: following - discovery (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)] 2023-07-25 06:47:54,984 WARN Unexpected exception, tries=0, remaining init limit=10000, connecting to /0.0.0.0:2888 (org.apache.zookeeper.server.quorum.Learner) [LeaderConnector-/0.0.0.0:2888] java.net.ConnectException: Connection refused at java.base/sun.nio.ch.Net.pollConnect(Native Method) at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672) at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542) at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597) at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327) at java.base/java.net.Socket.connect(Socket.java:633) at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304) at org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:292) at org.apache.zookeeper.server.quorum.Learner$LeaderConnector.connectToLeader(Learner.java:408) at org.apache.zookeeper.server.quorum.Learner$LeaderConnector.run(Learner.java:366){code} One thing I found, is this issue happened after "Restarting leader election" on the follower node. Not sure if it is related. *The configuration and setup:* # 2 zookeeper nodes # each zookeeper node, we set the IP of itself to 0.0.0.0, to workaround slow DNS in k8s issue. That is, For node 1, we have: {code:java} server.1=0.0.0.0:2888:3888:participant;127.0.0.1:12181 server.2=dev-dev2-zookeeper-1.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181{code} For node 2, we have: {code:java} server.1=dev-dev2-zookeeper-0.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181 server.2=0.0.0.0:2888:3888:participant;127.0.0.1:12181 {code} Logs: [zookeeper-custom-image-rep1.txt|https://github.com/strimzi/strimzi-kafka-operator/files/12158038/zookeeper-custom-image-rep1.txt] [zookeeper-custom-image-rep2.txt|https://github.com/strimzi/strimzi-kafka-operator/files/12158039/zookeeper-custom-image-rep2.txt] -- This message was sent by Atlassian Jira (v8.20.10#820010)