[jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant

Alexander Shraer (JIRA) Wed, 03 Jun 2015 21:45:48 -0700

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572133#comment-14572133
 ]


Alexander Shraer commented on ZOOKEEPER-2172:
---------------------------------------------

Server 1:

2015-06-03 17:15:27,692 [myid:1] - DEBUG [ProcessThread(sid:1 
cport:-1)::QuorumCnxManager@365] - Opening channel to server 3
2015-06-03 17:15:27,710 [myid:1] - DEBUG [ProcessThread(sid:1 
cport:-1)::QuorumCnxManager@371] - Connected to server 3
...
2015-06-03 17:15:31,649 [myid:1] - ERROR 
[LearnerHandler-/10.0.0.2:50710:LearnerHandler@580] - Unexpected exception 
causing shutdown while sock still open
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(Unknown Source)
        at java.net.SocketInputStream.read(Unknown Source)
        at java.io.BufferedInputStream.fill(Unknown Source)
        at java.io.BufferedInputStream.read(Unknown Source)
        at java.io.DataInputStream.readInt(Unknown Source)
        at 
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
        at 
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
        at 
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:493)
2015-06-03 17:15:31,650 [myid:1] - WARN  
[LearnerHandler-/10.0.0.2:50710:LearnerHandler@595] - ******* GOODBYE 
/10.0.0.2:50710 ********
Server 3:

2015-06-03 17:15:27,702 [myid:3] - INFO  
[/10.0.0.3:1200:QuorumCnxManager$Listener@556] - Received connection request 
/10.0.0.1:39184
2015-06-03 17:15:31,718 [myid:3] - WARN  [/10.0.0.3:1200:QuorumCnxManager@268] 
- Exception reading or writing challenge: java.net.SocketTimeoutException: Read 
timed out

It seems that server 3 is still in the initial connection phase while server 1 
already passed it and wants to start working (fails on the qp = new 
QuorumPacket() line).

It may be helpful to log everything server 3 gets from server 1 before it 
crashes in QuorumCnxnManager receiveConnection, exactly what execution path is 
taken, etc.  You could also print server 1's side in LearnerHandler.java

Also, are you sure that the zookeeper distribution on these two servers is the 
same ?



> Cluster crashes when reconfig a new node as a participant
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-2172
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection, quorum, server
>    Affects Versions: 3.5.0
>         Environment: Ubuntu 12.04 + java 7
>            Reporter: Ziyou Wang
>            Priority: Critical
>         Attachments: node-1.log, node-2.log, node-3.log, zoo-1.log, 
> zoo-2-1.log, zoo-2-2.log, zoo-2-3.log, zoo-2.log, zoo-3-1.log, zoo-3-2.log, 
> zoo-3-3.log, zoo-3.log, zoo.cfg.dynamic.10000005d, zoo.cfg.dynamic.next, 
> zookeeper-1.log, zookeeper-2.log, zookeeper-3.log
>
>
> The operations are quite simple: start three zk servers one by one, then 
> reconfig the cluster to add the new one as a participant. When I add the  
> third one, the zk cluster may enter a weird state and cannot recover.
>  
>       I found “2015-04-20 12:53:48,236 [myid:1] - INFO  [ProcessThread(sid:1 
> cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. 
> So the first node received the reconfig cmd at 12:53:48. Latter, it logged 
> “2015-04-20  12:53:52,230 [myid:1] - ERROR 
> [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception 
> causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] 
> - WARN  [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - ******* GOODBYE 
>  /10.0.0.2:55890 ********”. From then on, the first node and second node 
> rejected all client connections and the third node didn’t join the cluster as 
> a participant. The whole cluster was done.
>  
>      When the problem happened, all three nodes just used the same dynamic 
> config file zoo.cfg.dynamic.10000005d which only contained the first two 
> nodes. But there was another unused dynamic config file in node-1 directory 
> zoo.cfg.dynamic.next  which already contained three nodes.
>  
>      When I extended the waiting time between starting the third node and 
> reconfiguring the cluster, the problem didn’t show again. So it should be a 
> race condition problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant

Reply via email to