[ https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547490#comment-14547490 ]
Alexander Shraer commented on ZOOKEEPER-2172: --------------------------------------------- Server 1 continues to be leading till the end of the execution. It is able to push its version to server 3 but for some reason its leader election messages have round 0xffffffffffffffff so I think this is why servers 2 and 3 don't adopt it as leader. It also doesnt timeout for some reason. [~fpj], this looks related to ZOOKEEPER-1732 and ZOOKEEPER-1805, any thoughts ? Unlike what the description says, the .next file provided here is identical to the other config file and contains a config with servers 1 and 2, so it probably resulted from the reconfiguration adding server 2. Which server is this coming from ? server 2 probably ? That may happen if server 1 committed the reconfig but server 2 hasn't learned the commit yet (but its other config file has to be different in this case). It would be helpful if you can reproduce the scenario without using the ZK-2031 patch and provide all the config files, including the initial ones, from the servers and all the reconfig commands you run and when. I can see from the logs that there were many attempts to reconfigure (probably add server 2) before it was synced with server 1 so they failed, which is normal. Then a reconfig succeeds at time 12:51:48, then more reconfig commands are invoked (like minute 12:51:56), which is before server 3 even starts. Is this intentional ? What do these commands attempt to do ? > Cluster crashes when reconfig a new node as a participant > --------------------------------------------------------- > > Key: ZOOKEEPER-2172 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server > Affects Versions: 3.5.0 > Environment: Ubuntu 12.04 + java 7 > Reporter: Ziyou Wang > Priority: Critical > Attachments: node-1.log, node-2.log, node-3.log, > zoo.cfg.dynamic.10000005d, zoo.cfg.dynamic.next > > > The operations are quite simple: start three zk servers one by one, then > reconfig the cluster to add the new one as a participant. When I add the > third one, the zk cluster may enter a weird state and cannot recover. > > I found “2015-04-20 12:53:48,236 [myid:1] - INFO [ProcessThread(sid:1 > cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. > So the first node received the reconfig cmd at 12:53:48. Latter, it logged > “2015-04-20 12:53:52,230 [myid:1] - ERROR > [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception > causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] > - WARN [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - ******* GOODBYE > /10.0.0.2:55890 ********”. From then on, the first node and second node > rejected all client connections and the third node didn’t join the cluster as > a participant. The whole cluster was done. > > When the problem happened, all three nodes just used the same dynamic > config file zoo.cfg.dynamic.10000005d which only contained the first two > nodes. But there was another unused dynamic config file in node-1 directory > zoo.cfg.dynamic.next which already contained three nodes. > > When I extended the waiting time between starting the third node and > reconfiguring the cluster, the problem didn’t show again. So it should be a > race condition problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)