[jira] [Updated] (ZOOKEEPER-4816) A follower can not join the cluster for 30s seconds

mutu (Jira) Wed, 10 Jul 2024 19:20:04 -0700


     [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


mutu updated ZOOKEEPER-4816:
----------------------------
    Component/s: leaderElection

> A follower can not join the cluster for 30s seconds
> ---------------------------------------------------
>
>                 Key: ZOOKEEPER-4816
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4816
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.10.0
>            Reporter: mutu
>            Priority: Critical
>         Attachments: system1_FLE.log, system1_hang.log, system2_FLE.log, 
> system2_hang.log, system3_FLE.log, system3_hang.log
>
>
> We encounter a strange scenario. When we set up the cluster of zookeeper(3 
> nodes totally), the third node is stuck in ({*}sealStream{*}) serializing the 
> snapshot to the local disk. However, the leader election is executed 
> normally. After the election, the third node is elected as the leader. The 
> other two nodes fail to connect with the leader. Hence, the first and second 
> nodes restart the leader election, finally the second node is elected as the 
> leader. At this time, the third node still act as the leader. There are two 
> leaders in the cluster. The first node can not join the cluster for 30s. 
> The logs of the first node are as following.
> {code:java}
> 2024-07-01 02:35:28,223 [myid:] - INFO  
> [WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
>  - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING, n.leader:2, 
> n.round:0x2, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x02024-07-01 02:35:28,253 [myid:] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.FastLeaderElection@996]
>  - Notification time out: 12800 ms2024-07-01 02:35:28,254 [myid:] - INFO  
> [WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
>  - Notification: my state:LOOKING; n.sid:1, n.state:LOOKING, n.leader:2, 
> n.round:0x2, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x02024-07-01 02:35:28,287 [myid:] - INFO  
> [WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
>  - Notification: my state:LOOKING; n.sid:3, n.state:LEADING, n.leader:3, 
> n.round:0x1, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x02024-07-01 02:35:28,288 [myid:] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.FastLeaderElection@1205]
>  - Oracle indicates not to follow2024-07-01 02:35:29,493 [myid:] - WARN  
> [NIOWorkerThread-20:o.a.z.s.NIOServerCnxn@397] - Close of session 0x0{code}
> During this procedure, the client can not connect with any nodes of the 
> cluster.
> Runtime logs are attached.
> The root cause is the serializing the snapshot blocks the status modification 
> of the third node?
>  
> Sometimes, this phenomenon will disappears. The configuration is the same as 
> above. The second node is stuck in sealStream due to fail-slow disk.
> {code:java}
> // node1
> 2024-07-01 02:19:21,942 [myid:] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.QuorumPeer@920]
>  - Peer state changed: looking2024-07-01 02:19:21,943 [myid:] - WARN  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.QuorumPeer@1602]
>  - PeerState set to LOOKING2024-07-01 02:19:21,943 [myid:] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.QuorumPeer@1471]
>  - LOOKING2024-07-01 02:19:21,943 [myid:] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.FastLeaderElection@946]
>  - New election. My id = 1, proposed zxid=0x02024-07-01 02:19:21,980 [myid:] 
> - INFO  
> [WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
>  - Notification: my state:LOOKING; n.sid:1, n.state:LOOKING, n.leader:1, 
> n.round:0x2, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x02024-07-01 02:19:22,012 [myid:] - INFO  
> [WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
>  - Notification: my state:LOOKING; n.sid:3, n.state:FOLLOWING, n.leader:2, 
> n.round:0x1, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x02024-07-01 02:19:22,026 [myid:] - INFO  
> [WorkerReceiver[myid=1]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
>  - Notification: my state:LOOKING; n.sid:2, n.state:LEADING, n.leader:2, 
> n.round:0x1, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x02024-07-01 02:19:22,026 [myid:] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.QuorumPeer@906]
>  - Peer state changed: following {code}
> {code:java}
> // node3
> 2024-07-01 02:19:22,780 [myid:] - INFO  
> [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.QuorumPeer@920]
>  - Peer state changed: looking2024-07-01 02:19:22,780 [myid:] - WARN  
> [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.QuorumPeer@1602]
>  - PeerState set to LOOKING2024-07-01 02:19:22,780 [myid:] - INFO  
> [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.QuorumPeer@1471]
>  - LOOKING2024-07-01 02:19:22,781 [myid:] - INFO  
> [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.FastLeaderElection@946]
>  - New election. My id = 3, proposed zxid=0x02024-07-01 02:19:22,819 [myid:] 
> - INFO  
> [WorkerReceiver[myid=3]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
>  - Notification: my state:LOOKING; n.sid:3, n.state:LOOKING, n.leader:3, 
> n.round:0x2, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x02024-07-01 02:19:22,838 [myid:] - INFO  
> [WorkerReceiver[myid=3]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
>  - Notification: my state:LOOKING; n.sid:1, n.state:FOLLOWING, n.leader:2, 
> n.round:0x1, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x02024-07-01 02:19:22,849 [myid:] - INFO  
> [WorkerReceiver[myid=3]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerReceiver@391]
>  - Notification: my state:LOOKING; n.sid:2, n.state:LEADING, n.leader:2, 
> n.round:0x1, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x02024-07-01 02:19:22,850 [myid:] - INFO  
> [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.QuorumPeer@906]
>  - Peer state changed: following {code}
> But the leader election notification is strange.  Node1 follows node2 since 
> node3 follows node2. However, Node3 follows node2 since node1 follows node2. 
> This condition cannot stand.
> Moreover, this two scenarios are both stuck in sealStream. But there are 
> different results. Is there a data race?
> Are there any comments to figure out this issues？ I will very appreciate them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ZOOKEEPER-4816) A follower can not join the cluster for 30s seconds

Reply via email to