[jira] [Updated] (ZOOKEEPER-4816) A follower can not join the cluster for 20s seconds

mutu (Jira) Mon, 01 Jul 2024 02:14:54 -0700


     [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


mutu updated ZOOKEEPER-4816:
----------------------------
    Description: 
We encounter a strange scenario. When we set up the cluster of zookeeper(3 
nodes totally), the third node is stuck in ({*}sealStream{*}) serializing the 
snapshot to the local disk. However, the leader election is executed normally. 
After the election, the third node is elected as the leader. The other two 
nodes fail to connect with the leader. Hence, the first and second nodes 
restart the leader election, finally the second node is elected as the leader. 
At this time, the third node still act as the leader. There are two leaders in 
the cluster. The first node can not join the cluster for 20s. 

The logs of the first node are as following.

```

2024-03-12 07:20:51,552 [myid:] - INFO 
[WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
 - Notification: my state:LOOKING; n.sid:1, n.state:LOOKING, n.leader:1, 
n.round:0x2, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, n.config 
version:0x0 2024-03-12 07:20:51,565 [myid:] - INFO 
[WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
 - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING, n.leader:2, 
n.round:0x2, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, n.config 
version:0x0 2024-03-12 07:20:51,594 [myid:] - INFO 
[WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
 - Notification: my state:LOOKING; n.sid:1, n.state:LOOKING, n.leader:2, 
n.round:0x2, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, n.config 
version:0x0 2024-03-12 07:20:51,608 [myid:] - INFO 
[WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
 - Notification: my state:LOOKING; n.sid:3, n.state:LEADING, n.leader:3, 
n.round:0x1, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, n.config 
version:0x0 2024-03-12 07:20:51,608 [myid:] - INFO 
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.FastLeaderElection@1205]
 - Oracle indicates not to follow

```

During this procedure, the client can not connect with any nodes of the cluster.

Runtime logs are attached.

The root cause is the serializing the snapshot blocks the status modification 
of the third node?

Are there any comments to figure out this issues？ I will very appreciate them.

  was:
We encounter a strange scenario. When we set up the cluster of zookeeper(3 
nodes totally), the third node is stuck in ({*}sealStream{*}) serializing the 
snapshot to the local disk. However, the leader election is executed normally. 
After the election, the third node is elected as the leader. The other two 
nodes fail to connect with the leader. Hence, the first and second nodes 
restart the leader election, finally the second node is elected as the leader. 
At this time, the third node still act as the leader. There are two leaders in 
the cluster. The first node can not join the cluster for 20s. 

The logs of the first node are as following.

2024-03-12 07:20:51,552 [myid:] - INFO 
[WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
 - Notification: my state:LOOKING; n.sid:1, n.state:LOOKING, n.leader:1, 
n.round:0x2, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, n.config 
version:0x0 2024-03-12 07:20:51,565 [myid:] - INFO 
[WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
 - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING, n.leader:2, 
n.round:0x2, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, n.config 
version:0x0 2024-03-12 07:20:51,594 [myid:] - INFO 
[WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
 - Notification: my state:LOOKING; n.sid:1, n.state:LOOKING, n.leader:2, 
n.round:0x2, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, n.config 
version:0x0 2024-03-12 07:20:51,608 [myid:] - INFO 
[WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
 - Notification: my state:LOOKING; n.sid:3, n.state:LEADING, n.leader:3, 
n.round:0x1, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, n.config 
version:0x0 2024-03-12 07:20:51,608 [myid:] - INFO 
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.FastLeaderElection@1205]
 - Oracle indicates not to follow

During this procedure, the client can not connect with any nodes of the cluster.

Runtime logs are attached.

The root cause is the serializing the snapshot blocks the status modification 
of the third node?

Are there any comments to figure out this issues？ I will very appreciate them.


> A follower can not join the cluster for 20s seconds
> ---------------------------------------------------
>
>                 Key: ZOOKEEPER-4816
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4816
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.10.0
>            Reporter: mutu
>            Priority: Critical
>         Attachments: node1.log, node2.log, node3.log
>
>
> We encounter a strange scenario. When we set up the cluster of zookeeper(3 
> nodes totally), the third node is stuck in ({*}sealStream{*}) serializing the 
> snapshot to the local disk. However, the leader election is executed 
> normally. After the election, the third node is elected as the leader. The 
> other two nodes fail to connect with the leader. Hence, the first and second 
> nodes restart the leader election, finally the second node is elected as the 
> leader. At this time, the third node still act as the leader. There are two 
> leaders in the cluster. The first node can not join the cluster for 20s. 
> The logs of the first node are as following.
> ```
> 2024-03-12 07:20:51,552 [myid:] - INFO 
> [WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
>  - Notification: my state:LOOKING; n.sid:1, n.state:LOOKING, n.leader:1, 
> n.round:0x2, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x0 2024-03-12 07:20:51,565 [myid:] - INFO 
> [WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
>  - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING, n.leader:2, 
> n.round:0x2, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x0 2024-03-12 07:20:51,594 [myid:] - INFO 
> [WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
>  - Notification: my state:LOOKING; n.sid:1, n.state:LOOKING, n.leader:2, 
> n.round:0x2, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x0 2024-03-12 07:20:51,608 [myid:] - INFO 
> [WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
>  - Notification: my state:LOOKING; n.sid:3, n.state:LEADING, n.leader:3, 
> n.round:0x1, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x0 2024-03-12 07:20:51,608 [myid:] - INFO 
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.FastLeaderElection@1205]
>  - Oracle indicates not to follow
> ```
> During this procedure, the client can not connect with any nodes of the 
> cluster.
> Runtime logs are attached.
> The root cause is the serializing the snapshot blocks the status modification 
> of the third node?
> Are there any comments to figure out this issues？ I will very appreciate them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ZOOKEEPER-4816) A follower can not join the cluster for 20s seconds

Reply via email to