[ https://issues.apache.org/jira/browse/ZOOKEEPER-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17864631#comment-17864631 ]
luoxin commented on ZOOKEEPER-4724: ----------------------------------- The inconsistency in the server list might be causing the problem. As server-1 becomes the leader, it synchronizes the current server list to server-2:{{{}{}}} server.1=0.0.0.0:2888:3888:participant;127.0.0.1:12181 server.2=dev-dev2-zookeeper-1.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181 Upon receiving the updated server list, server-2 identifies server-1 as the leader. Consequently, server-2 restarts election and attempts to connect to the leader(server-1) using the new address 0.0.0.0:2888. > follower can't connect to the right leader and quorum failed to form > -------------------------------------------------------------------- > > Key: ZOOKEEPER-4724 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4724 > Project: ZooKeeper > Issue Type: Bug > Affects Versions: 3.6.4 > Reporter: Luke Chen > Priority: Major > > When entering "following - discovery" state, the follower will connect to the > leader node to reach a quorum. But recently, a user faced the issue that the > follower can't connect to the right leader and quorum failed to form. From > the log, I can see the follower is trying to connect to itself > (0.0.0.0:2888), instead of the leader. After 5 retries, a new election > started, and all the things happen again, that is, the node becomes a > follower, and try to connect to itself, and again, and again... > > The log is like this: > {code:java} > 2023-07-25 06:47:54,982 INFO FOLLOWING - LEADER ELECTION TOOK - 9802 MS > (org.apache.zookeeper.server.quorum.Learner) > [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)] > 2023-07-25 06:47:54,983 INFO Peer state changed: following - discovery > (org.apache.zookeeper.server.quorum.QuorumPeer) > [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)] > 2023-07-25 06:47:54,984 WARN Unexpected exception, tries=0, remaining init > limit=10000, connecting to /0.0.0.0:2888 > (org.apache.zookeeper.server.quorum.Learner) [LeaderConnector-/0.0.0.0:2888] > java.net.ConnectException: Connection refused > at java.base/sun.nio.ch.Net.pollConnect(Native Method) > at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672) > at > java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542) > at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597) > at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327) > at java.base/java.net.Socket.connect(Socket.java:633) > at > java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304) > at > org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:292) > at > org.apache.zookeeper.server.quorum.Learner$LeaderConnector.connectToLeader(Learner.java:408) > at > org.apache.zookeeper.server.quorum.Learner$LeaderConnector.run(Learner.java:366){code} > > One thing I found, is this issue happened after "Restarting leader election" > on the follower node. Not sure if it is related. > > I was thinking if it is some race condition between "restarting leader > election" happened (reset vote to itself) and vote update. But as mentioned > above, this issue keeps happening after next round of leader election. > > *The configuration and setup:* > # 2 zookeeper nodes > # each zookeeper node, we set the IP of itself to 0.0.0.0, to workaround > slow DNS in k8s issue (i.e. ZOOKEEPER-4708). That is, > For node 1, we have: > {code:java} > server.1=0.0.0.0:2888:3888:participant;127.0.0.1:12181 > server.2=dev-dev2-zookeeper-1.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181{code} > For node 2, we have: > {code:java} > server.1=dev-dev2-zookeeper-0.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181 > server.2=0.0.0.0:2888:3888:participant;127.0.0.1:12181 {code} > Logs: > [zookeeper-custom-image-rep1.txt|https://github.com/strimzi/strimzi-kafka-operator/files/12158038/zookeeper-custom-image-rep1.txt] > [zookeeper-custom-image-rep2.txt|https://github.com/strimzi/strimzi-kafka-operator/files/12158039/zookeeper-custom-image-rep2.txt] -- This message was sent by Atlassian Jira (v8.20.10#820010)