[
https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15390949#comment-15390949
]
Arshad Mohammad commented on ZOOKEEPER-2172:
--------------------------------------------
We also faced this issue.
The problem occurs when reconfig's PROPOSAL and COMMITANDACTIVATE come in
between the snapshot and the uptodate
Following steps can followed to reproduce this issue very easily:
# start three server zookeeper cluster, lets say servers are server-1,
server-2, server-3
# create big data in zookeeper, around 150 MB
# install the fourth server server-4, add server information of all the four
servers in server-4
{code}
server.1=192.168.1.3:2888:3888:participant
server.2=192.168.1.3:2889:3889:participant
server.3=192.168.1.3:2890:3890:participant
server.4=192.168.1.2:2890:3890:participant
{code}
# connect to any of the existing servers
# start server-4, immediately run reconfig command from already connected
client.
{{reconfig -add server.4=192.168.1.2:2890:3890:participant;2181}}
# open zookeeper/conf folder, you will find zoo.cfg.dynamic.next and existing
quorum dynamic configuration file zoo.cfg.dynamic.100000000
zoo.cfg.dynamic.next --> this has information of all the servers
zoo.cfg.dynamic.100000000 --> this has information of only existing servers
server-1,server-2,server-3
# Even though server-4 started and joined the quorum, if try to to restart, it
will fail with following errors
{code}
2016-07-24 11:00:11,689 [myid:4] - ERROR [main:QuorumPeerMain@98] - Unexpected
exception, exiting abnormally
java.lang.RuntimeException: My id 4 not in the peer list
at
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:748)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:183)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:120)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:79)
{code}
> Cluster crashes when reconfig a new node as a participant
> ---------------------------------------------------------
>
> Key: ZOOKEEPER-2172
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172
> Project: ZooKeeper
> Issue Type: Bug
> Components: leaderElection, quorum, server
> Affects Versions: 3.5.0
> Environment: Ubuntu 12.04 + java 7
> Reporter: Ziyou Wang
> Assignee: Arshad Mohammad
> Priority: Critical
> Fix For: 3.5.3
>
> Attachments: ZOOKEEPER-2172.patch, history.txt, node-1.log,
> node-2.log, node-3.log, zoo-1.log, zoo-2-1.log, zoo-2-2.log, zoo-2-3.log,
> zoo-2.log, zoo-2212-1.log, zoo-2212-2.log, zoo-2212-3.log, zoo-3-1.log,
> zoo-3-2.log, zoo-3-3.log, zoo-3.log, zoo-4-1.log, zoo-4-2.log, zoo-4-3.log,
> zoo.cfg.dynamic.10000005d, zoo.cfg.dynamic.next, zookeeper-1.log,
> zookeeper-1.out, zookeeper-2.log, zookeeper-2.out, zookeeper-3.log,
> zookeeper-3.out
>
>
> The operations are quite simple: start three zk servers one by one, then
> reconfig the cluster to add the new one as a participant. When I add the
> third one, the zk cluster may enter a weird state and cannot recover.
>
> I found “2015-04-20 12:53:48,236 [myid:1] - INFO [ProcessThread(sid:1
> cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log.
> So the first node received the reconfig cmd at 12:53:48. Latter, it logged
> “2015-04-20 12:53:52,230 [myid:1] - ERROR
> [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception
> causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1]
> - WARN [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - ******* GOODBYE
> /10.0.0.2:55890 ********”. From then on, the first node and second node
> rejected all client connections and the third node didn’t join the cluster as
> a participant. The whole cluster was done.
>
> When the problem happened, all three nodes just used the same dynamic
> config file zoo.cfg.dynamic.10000005d which only contained the first two
> nodes. But there was another unused dynamic config file in node-1 directory
> zoo.cfg.dynamic.next which already contained three nodes.
>
> When I extended the waiting time between starting the third node and
> reconfiguring the cluster, the problem didn’t show again. So it should be a
> race condition problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)