[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167133#comment-17167133
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3902:
---------------------------------------------

I think ZooKeeper 3.4 is about to be EOL by now, so I don't think many people 
will investigate this issue.

Without going deep, the symptoms you described reminds me to a known issue in 
older ZooKeepers, I saw many times on production and test clusters: Usually 
what happens is that some security port scanner or misconfigured script / 
application tries to connect to the ZooKeeper leader election port (3888 by 
default) and sending random messages there which is causing the 
QuorumCnxManager thread in ZooKeeper to die (with some unhandled message 
parsing exception), and from that point the given node can not participate in 
leader election. Which will cause other ZooKeeper servers unable to connect to 
it, or in extreme cases no new quorum can form. The cluster rolling restart 
usually fixes this problem. Also upgrading to newer version helps (e.g. 
ZOOKEEPER-2029 was fixed only in 3.5).

I'm not sure this is the problem in your case (to make sure, we would need to 
check all the logs, possibly even since the time zookeeper-4.us-west.com become 
leader). It is possible you were hitting a completely different bug. (e.g. the 
socket hanged for some reason, which I remember seeing in Kubernetes 
deployments)

Anyway, I think it is a good idea to upgrade to a more recent stable version 
(3.5.8 or 3.6.1). There are a lot of bugfixes included those.

> ZooKeeper cluster did not perform leader reelection when leader host crashed
> ----------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3902
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3902
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.4.13
>            Reporter: Rajan Dhabalia
>            Priority: Major
>
> I saw this issue in one of the zookeeper cluster where LEADER host crashed 
> due to h/w issue and all the follower hosts immediately closed client 
> connection with leader host but zk-cluster did not trigger leader reelection 
> until we manually restarted few zookeeper host after some time.
> Snapshot size: 200MB
> ZooKeeper version: 3.4.13
> Below logs are from Leader and one of the follower host, when Leader host was 
> crashed.
> 1. Leader host: zookeeper-4.us-west.com crashed at 16:21:09 and no 
> application logs.
> 2. Follower host: zookeeper-2.us-west.com immediately closed 
> client-connection with zookeeper-4.us-west.com at 16:21:09
> 3. None of the Follower hosts triggered LEADER reelection and don't see any 
> such logs
> 4. All the zk-clients started getting zookeeper session timeouts
> 5. after 27 minutes at 16:48:30, Follower host found out the issue in quorum 
> and logged
> *"Have smaller server identifier, so dropping the connection"*
> 6. But it still did not trigger leader reelection until we manually restarted 
> zk-process in few zk hosts after 5 mins
> 7. and finally after restarting process in few zk hosts manually: zk cluster 
> did leader reelection and created the quorum
> *Logs:*
> *Leader zookeeper: zookeeper-4.us-west.com*
> {noformat}
> Leader zookeeper: zookeeper-4.us-west.com
> Host crashed at : 16:21:10,095 and last log was `16:21:10,095`. and then it 
> came back after an hour: `17:25:27` and joined as FOLLOWER.
> 16:21:09,965 - INFO 
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - 
> Accepted socket connection from /0.0.0.1:45186
> 16:21:09,965 - INFO 
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@908] - Processing 
> mntr command from /0.0.0.1:45186
> 16:21:10,095 - INFO [Thread-194783:NIOServerCnxn@1056] - Closed socket 
> connection for client /0.0.0.1:45186 (no session established for client)
> 17:25:27,262 - INFO [main:QuorumPeerConfig@136] - Reading configuration from: 
> {noformat}
>  
> *Follower zookeeper: zookeeper-2.us-west.com*
> {noformat}
> 16:21:09,743 - INFO [Thread-78288:NIOServerCnxn@1056] - Closed socket 
> connection for client /4.4.4.4:52216 (zookeeper-4.us-west.com-IP) (no session 
> established for client)
> :
> # ZK host didn't trigger the LEADER-REELECTION even despite 
> zookeeper-4.us-west.com crashed and lost connection with that host at : 
> `16:21:09,743`
> # Client was keep connecting and keep disconnecting due to zk session timeout
> 16:48:29,961 - INFO 
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - 
> Accepted socket connection from /0.0.0.2:43938
> 16:48:29,965 - INFO 
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@949] - Client 
> attempting to establish new session at /0.0.0.2:43938
> 16:48:29,977 - INFO 
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - 
> Accepted socket connection from /0.0.0.3:54630
> 16:48:29,981 - INFO 
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@949] - Client 
> attempting to establish new session at /0.0.0.3:54630
> :
> # After 27 Minutes: Saw first log from QuorumCnxManager and zk host was 
> figuring out if needs leaer-relection
> 16:48:30,475 - INFO [WorkerSender[myid=2]:QuorumCnxManager@347] - Have 
> smaller server identifier, so dropping the connection: (3, 2)
> 16:48:30,475 - INFO [WorkerReceiver[myid=2]:FastLeaderElection@595] - 
> Notification: 1 (message format version), 2 (n.leader), 0x6d00328f73 
> (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x6d (n.peerEpoch) 
> LOOKING (my state)
> 16:48:30,476 - INFO 
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@908] - Processing 
> mntr command from /2.2.2.2:55136
> 16:48:30,476 - INFO [WorkerReceiver[myid=2]:FastLeaderElection@595] - 
> Notification: 1 (message format version), 4 (n.leader), 0x6c00b36a5a 
> (n.zxid), 0x35e (n.round), LOOKING (n.state), 1 (n.sid), 0x6c (n.peerEpoch) 
> LOOKING (my state)
> 16:48:30,476 - INFO 
> [prod1-zk2.messaging.gq1.yahoo.com/2.2.2.2:2183:QuorumCnxManager$Listener@743]
>  - Received connection request /3.3.3.3:34092
> 16:48:30,476 - INFO [WorkerSender[myid=2]:QuorumCnxManager@347] - Have 
> smaller server identifier, so dropping the connection: (4, 2)
> :
> # After manually restarting follower host, finally follower hosts 
> 16:50:53,986 - INFO [main:QuorumPeerConfig@398] - Defaulting to majority 
> quorums
> 16:50:53,991 - INFO [main:DatadirCleanupManager@78] - 
> autopurge.snapRetainCount set to 3
> 16:50:53,991 - INFO [main:DatadirCleanupManager@79] - autopurge.purgeInterval 
> set to 1
> 16:50:53,992 - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge 
> task started.
> 16:50:54,003 - INFO [main:QuorumPeerMain@130] - Starting quorum peer
> 16:50:54,026 - INFO [main:FileSnap@86] - Reading snapshot
> :
> 16:55:04,377 - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@65] - 
> FOLLOWING - LEADER ELECTION TOOK - 233617
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to