[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17864631#comment-17864631
 ] 

luoxin commented on ZOOKEEPER-4724:
-----------------------------------

The inconsistency in the server list might be causing the problem. As server-1 
becomes the leader, it synchronizes the current server list to server-2:{{{}{}}}
server.1=0.0.0.0:2888:3888:participant;127.0.0.1:12181
server.2=dev-dev2-zookeeper-1.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181
Upon receiving the updated server list, server-2 identifies server-1 as the 
leader. Consequently, server-2 restarts election and attempts to connect to the 
leader(server-1) using the new address 0.0.0.0:2888.

> follower can't connect to the right leader and quorum failed to form
> --------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4724
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4724
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.6.4
>            Reporter: Luke Chen
>            Priority: Major
>
> When entering "following - discovery" state, the follower will connect to the 
> leader node to reach a quorum. But recently, a user faced the issue that the 
> follower can't connect to the right leader and quorum failed to form. From 
> the log, I can see the follower is trying to connect to itself 
> (0.0.0.0:2888), instead of the leader. After 5 retries, a new election 
> started, and all the things happen again, that is, the node becomes a 
> follower, and try to connect to itself, and again, and again...
>  
> The log is like this:
> {code:java}
> 2023-07-25 06:47:54,982 INFO FOLLOWING - LEADER ELECTION TOOK - 9802 MS 
> (org.apache.zookeeper.server.quorum.Learner) 
> [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)]
> 2023-07-25 06:47:54,983 INFO Peer state changed: following - discovery 
> (org.apache.zookeeper.server.quorum.QuorumPeer) 
> [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)]
> 2023-07-25 06:47:54,984 WARN Unexpected exception, tries=0, remaining init 
> limit=10000, connecting to /0.0.0.0:2888 
> (org.apache.zookeeper.server.quorum.Learner) [LeaderConnector-/0.0.0.0:2888]
> java.net.ConnectException: Connection refused
>     at java.base/sun.nio.ch.Net.pollConnect(Native Method)
>     at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
>     at 
> java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
>     at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
>     at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
>     at java.base/java.net.Socket.connect(Socket.java:633)
>     at 
> java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304)
>     at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:292)
>     at 
> org.apache.zookeeper.server.quorum.Learner$LeaderConnector.connectToLeader(Learner.java:408)
>     at 
> org.apache.zookeeper.server.quorum.Learner$LeaderConnector.run(Learner.java:366){code}
>  
> One thing I found, is this issue happened after "Restarting leader election" 
> on the follower node. Not sure if it is related.
>  
> I was thinking if it is some race condition between "restarting leader 
> election" happened (reset vote to itself) and vote update. But as mentioned 
> above, this issue keeps happening after next round of leader election.
>  
> *The configuration and setup:*
>  # 2 zookeeper nodes
>  # each zookeeper node, we set the IP of itself to 0.0.0.0, to workaround 
> slow DNS in k8s issue (i.e. ZOOKEEPER-4708). That is,
> For node 1, we have:
> {code:java}
> server.1=0.0.0.0:2888:3888:participant;127.0.0.1:12181
> server.2=dev-dev2-zookeeper-1.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181{code}
> For node 2, we have:
> {code:java}
> server.1=dev-dev2-zookeeper-0.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181
> server.2=0.0.0.0:2888:3888:participant;127.0.0.1:12181 {code}
> Logs:
> [zookeeper-custom-image-rep1.txt|https://github.com/strimzi/strimzi-kafka-operator/files/12158038/zookeeper-custom-image-rep1.txt]
> [zookeeper-custom-image-rep2.txt|https://github.com/strimzi/strimzi-kafka-operator/files/12158039/zookeeper-custom-image-rep2.txt]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to