[ https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14558059#comment-14558059 ]
Michi Mutsuzaki commented on ZOOKEEPER-2172: -------------------------------------------- Thanks Ziyou. [~fpj] [~shralex] it looks like node1 and node2 are not forming a quorum because node2 has seen zxid 0x100000059 but node1 keeps sending 0x0 as its zxid. Isn't node1 supposed send the highest zxid it has seen? >From zookeeper-1.log: {noformat} 2015-05-25 12:34:36,920 [myid:1] - DEBUG [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@423] - Sending new notification. My id =1 recipient=2 zxid=0x0 leader=1 config version = 100000049 2015-05-25 12:34:39,090 [myid:1] - DEBUG [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@423] - Sending new notification. My id =1 recipient=3 zxid=0x0 leader=1 config version = 100000049 2015-05-25 12:35:28,128 [myid:1] - DEBUG [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@423] - Sending new notification. My id =1 recipient=2 zxid=0x0 leader=1 config version = 100000049 2015-05-25 12:35:30,301 [myid:1] - DEBUG [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@423] - Sending new notification. My id =1 recipient=3 zxid=0x0 leader=1 config version = 100000049 {noformat} >From zookeeper-2.log: {noformat} 2015-05-25 12:34:36,918 [myid:2] - INFO [WorkerReceiver[myid=2]:FastLeaderElection@698] - Notification: 2 (message format version), 2 (n.leader), 0x100000059 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)100000049 (n.config version) 2015-05-25 12:34:36,923 [myid:2] - INFO [WorkerReceiver[myid=2]:FastLeaderElection@698] - Notification: 2 (message format version), 1 (n.leader), 0x0 (n.zxid), 0xffffffffffffffff (n.round), LEADING (n.state), 1 (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)100000049 (n.config version) 2015-05-25 12:35:28,124 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.0.0.2:1300:FastLeaderElection@688] - Sending Notification: 2 (n.leader), 0x100000059 (n.zxid), 0x1 (n.round), 1 (recipient), 2 (myid), 0x1 (n.peerEpoch) 2015-05-25 12:35:28,125 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.0.0.2:1300:FastLeaderElection@688] - Sending Notification: 2 (n.leader), 0x100000059 (n.zxid), 0x1 (n.round), 2 (recipient), 2 (myid), 0x1 (n.peerEpoch) {noformat} > Cluster crashes when reconfig a new node as a participant > --------------------------------------------------------- > > Key: ZOOKEEPER-2172 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server > Affects Versions: 3.5.0 > Environment: Ubuntu 12.04 + java 7 > Reporter: Ziyou Wang > Priority: Critical > Attachments: node-1.log, node-2.log, node-3.log, > zoo.cfg.dynamic.10000005d, zoo.cfg.dynamic.next, zookeeper-1.log, > zookeeper-2.log, zookeeper-3.log > > > The operations are quite simple: start three zk servers one by one, then > reconfig the cluster to add the new one as a participant. When I add the > third one, the zk cluster may enter a weird state and cannot recover. > > I found “2015-04-20 12:53:48,236 [myid:1] - INFO [ProcessThread(sid:1 > cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. > So the first node received the reconfig cmd at 12:53:48. Latter, it logged > “2015-04-20 12:53:52,230 [myid:1] - ERROR > [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception > causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] > - WARN [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - ******* GOODBYE > /10.0.0.2:55890 ********”. From then on, the first node and second node > rejected all client connections and the third node didn’t join the cluster as > a participant. The whole cluster was done. > > When the problem happened, all three nodes just used the same dynamic > config file zoo.cfg.dynamic.10000005d which only contained the first two > nodes. But there was another unused dynamic config file in node-1 directory > zoo.cfg.dynamic.next which already contained three nodes. > > When I extended the waiting time between starting the third node and > reconfiguring the cluster, the problem didn’t show again. So it should be a > race condition problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)