[jira] [Updated] (ZOOKEEPER-4724) follower can't connect to the right leader and quorum failed to form

Luke Chen (Jira) Wed, 26 Jul 2023 05:25:49 -0700


     [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Luke Chen updated ZOOKEEPER-4724:
---------------------------------
    Description: 
When entering "following - discovery" state, the follower will connect to the 
leader node to reach a quorum. But recently, a user faced the issue that the 
follower can't connect to the right leader and quorum failed to form. >From the 
log, I can see the follower is trying to connect to itself (0.0.0.0:2888), 
instead of the leader. After 5 retries, a new election started, and all the 
things happen again, that is, the node becomes a follower, and try to connect 
to itself, and again, and again...

 

The log is like this:
{code:java}
2023-07-25 06:47:54,982 INFO FOLLOWING - LEADER ELECTION TOOK - 9802 MS 
(org.apache.zookeeper.server.quorum.Learner) 
[QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)]
2023-07-25 06:47:54,983 INFO Peer state changed: following - discovery 
(org.apache.zookeeper.server.quorum.QuorumPeer) 
[QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)]
2023-07-25 06:47:54,984 WARN Unexpected exception, tries=0, remaining init 
limit=10000, connecting to /0.0.0.0:2888 
(org.apache.zookeeper.server.quorum.Learner) [LeaderConnector-/0.0.0.0:2888]
java.net.ConnectException: Connection refused
    at java.base/sun.nio.ch.Net.pollConnect(Native Method)
    at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
    at 
java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304)
    at org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:292)
    at 
org.apache.zookeeper.server.quorum.Learner$LeaderConnector.connectToLeader(Learner.java:408)
    at 
org.apache.zookeeper.server.quorum.Learner$LeaderConnector.run(Learner.java:366){code}
 

One thing I found, is this issue happened after "Restarting leader election" on 
the follower node. Not sure if it is related.

 

I was thinking if it is some race condition between "restarting leader 
election" happened (reset vote to itself) and vote update. But as mentioned 
above, this issue keeps happening after next round of leader election.

 

*The configuration and setup:*
 # 2 zookeeper nodes
 # each zookeeper node, we set the IP of itself to 0.0.0.0, to workaround slow 
DNS in k8s issue (i.e. ZOOKEEPER-4708). That is,
For node 1, we have:

{code:java}
server.1=0.0.0.0:2888:3888:participant;127.0.0.1:12181
server.2=dev-dev2-zookeeper-1.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181{code}
For node 2, we have:
{code:java}
server.1=dev-dev2-zookeeper-0.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181
server.2=0.0.0.0:2888:3888:participant;127.0.0.1:12181 {code}
Logs:

[zookeeper-custom-image-rep1.txt|https://github.com/strimzi/strimzi-kafka-operator/files/12158038/zookeeper-custom-image-rep1.txt]
[zookeeper-custom-image-rep2.txt|https://github.com/strimzi/strimzi-kafka-operator/files/12158039/zookeeper-custom-image-rep2.txt]

  was:
When entering "following - discovery" state, the follower will connect to the 
leader node to reach a quorum. But recently, a user faced the issue that the 
follower can't connect to the right leader and quorum failed to form. >From the 
log, I can see the follower is trying to connect to itself (0.0.0.0:2888), 
instead of the leader. After 5 retries, a new election started, and all the 
things happen again, that is, the node becomes a follower, and try to connect 
to itself, and again, and again...

 

The log is like this:
{code:java}
2023-07-25 06:47:54,982 INFO FOLLOWING - LEADER ELECTION TOOK - 9802 MS 
(org.apache.zookeeper.server.quorum.Learner) 
[QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)]
2023-07-25 06:47:54,983 INFO Peer state changed: following - discovery 
(org.apache.zookeeper.server.quorum.QuorumPeer) 
[QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)]
2023-07-25 06:47:54,984 WARN Unexpected exception, tries=0, remaining init 
limit=10000, connecting to /0.0.0.0:2888 
(org.apache.zookeeper.server.quorum.Learner) [LeaderConnector-/0.0.0.0:2888]
java.net.ConnectException: Connection refused
    at java.base/sun.nio.ch.Net.pollConnect(Native Method)
    at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
    at 
java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304)
    at org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:292)
    at 
org.apache.zookeeper.server.quorum.Learner$LeaderConnector.connectToLeader(Learner.java:408)
    at 
org.apache.zookeeper.server.quorum.Learner$LeaderConnector.run(Learner.java:366){code}
 

One thing I found, is this issue happened after "Restarting leader election" on 
the follower node. Not sure if it is related.

 

I was thinking if it is some race condition between "restarting leader 
election" happened (reset vote to itself) and vote update. But as mentioned 
above, this issue keeps happening after next round of leader election.

 

*The configuration and setup:*
 # 2 zookeeper nodes
 # each zookeeper node, we set the IP of itself to 0.0.0.0, to workaround slow 
DNS in k8s issue. That is,
For node 1, we have:

{code:java}
server.1=0.0.0.0:2888:3888:participant;127.0.0.1:12181
server.2=dev-dev2-zookeeper-1.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181{code}
For node 2, we have:
{code:java}
server.1=dev-dev2-zookeeper-0.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181
server.2=0.0.0.0:2888:3888:participant;127.0.0.1:12181 {code}
Logs:

[zookeeper-custom-image-rep1.txt|https://github.com/strimzi/strimzi-kafka-operator/files/12158038/zookeeper-custom-image-rep1.txt]
[zookeeper-custom-image-rep2.txt|https://github.com/strimzi/strimzi-kafka-operator/files/12158039/zookeeper-custom-image-rep2.txt]


> follower can't connect to the right leader and quorum failed to form
> --------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4724
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4724
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.6.4
>            Reporter: Luke Chen
>            Priority: Major
>
> When entering "following - discovery" state, the follower will connect to the 
> leader node to reach a quorum. But recently, a user faced the issue that the 
> follower can't connect to the right leader and quorum failed to form. From 
> the log, I can see the follower is trying to connect to itself 
> (0.0.0.0:2888), instead of the leader. After 5 retries, a new election 
> started, and all the things happen again, that is, the node becomes a 
> follower, and try to connect to itself, and again, and again...
>  
> The log is like this:
> {code:java}
> 2023-07-25 06:47:54,982 INFO FOLLOWING - LEADER ELECTION TOOK - 9802 MS 
> (org.apache.zookeeper.server.quorum.Learner) 
> [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)]
> 2023-07-25 06:47:54,983 INFO Peer state changed: following - discovery 
> (org.apache.zookeeper.server.quorum.QuorumPeer) 
> [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)]
> 2023-07-25 06:47:54,984 WARN Unexpected exception, tries=0, remaining init 
> limit=10000, connecting to /0.0.0.0:2888 
> (org.apache.zookeeper.server.quorum.Learner) [LeaderConnector-/0.0.0.0:2888]
> java.net.ConnectException: Connection refused
>     at java.base/sun.nio.ch.Net.pollConnect(Native Method)
>     at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
>     at 
> java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
>     at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
>     at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
>     at java.base/java.net.Socket.connect(Socket.java:633)
>     at 
> java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304)
>     at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:292)
>     at 
> org.apache.zookeeper.server.quorum.Learner$LeaderConnector.connectToLeader(Learner.java:408)
>     at 
> org.apache.zookeeper.server.quorum.Learner$LeaderConnector.run(Learner.java:366){code}
>  
> One thing I found, is this issue happened after "Restarting leader election" 
> on the follower node. Not sure if it is related.
>  
> I was thinking if it is some race condition between "restarting leader 
> election" happened (reset vote to itself) and vote update. But as mentioned 
> above, this issue keeps happening after next round of leader election.
>  
> *The configuration and setup:*
>  # 2 zookeeper nodes
>  # each zookeeper node, we set the IP of itself to 0.0.0.0, to workaround 
> slow DNS in k8s issue (i.e. ZOOKEEPER-4708). That is,
> For node 1, we have:
> {code:java}
> server.1=0.0.0.0:2888:3888:participant;127.0.0.1:12181
> server.2=dev-dev2-zookeeper-1.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181{code}
> For node 2, we have:
> {code:java}
> server.1=dev-dev2-zookeeper-0.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181
> server.2=0.0.0.0:2888:3888:participant;127.0.0.1:12181 {code}
> Logs:
> [zookeeper-custom-image-rep1.txt|https://github.com/strimzi/strimzi-kafka-operator/files/12158038/zookeeper-custom-image-rep1.txt]
> [zookeeper-custom-image-rep2.txt|https://github.com/strimzi/strimzi-kafka-operator/files/12158039/zookeeper-custom-image-rep2.txt]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ZOOKEEPER-4724) follower can't connect to the right leader and quorum failed to form

Reply via email to