[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15986329#comment-15986329
 ] 

Patrick Kleindienst commented on ZOOKEEPER-2766:
------------------------------------------------

I kept the defaults for initLimit and syncLimit in that case. When downgrading 
to ZooKeeper 3.4 leader election works perfectly (also using the defaults for 
that attributes). However, checking that by increasing these values might be 
reasonable as a next step I guess.

> Quorum fails with java.io.EOFException
> --------------------------------------
>
>                 Key: ZOOKEEPER-2766
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2766
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection, quorum
>    Affects Versions: 3.5.3
>         Environment: CentOS-7, Docker version 17.03.1-ce
>            Reporter: Patrick Kleindienst
>         Attachments: logs-node1.txt, logs-node2.txt, logs-node3.txt
>
>
> When I start a ZooKeeper ensemble comprising 3 nodes, I'm currently facing 
> the following behavior:
> Two nodes (let's say node 2 and 3) out of the three start their own quorum, 
> and finally one of them is elected the new leader (node 3), while the other 
> one becomes the follower (node 2). Node 1 seems to be able to establish a 
> connection to node 3 (elected leader) in my case, but this seems to fail for 
> node 2. 
> Node 1 shows the following in its logs:
>  2017-04-25 09:24:02,806 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1055]
>  - LOOKING
> 2017-04-25 09:24:02,808 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):FastLeaderElection@894]
>  - New election. My id =  1, proposed zxid=0x0
> 2017-04-25 09:24:02,811 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection@688] - Notification: 2 (message 
> format version), 1 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING 
> (n.state), 1 (n.sid), 0x0 (
> n.peerEPoch), LOOKING (my state)0 (n.config version)
> 2017-04-25 09:24:02,817 [myid:1] - WARN  
> [WorkerSender[myid=1]:QuorumCnxManager@457] - Cannot open channel to 2 at 
> election address /9.152.171.98:3888
> java.net.ConnectException: Connection refused (Connection refused)
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>         at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>         at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:589)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:443)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:486)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:421)
>         at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:486)
>         at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:465)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-04-25 09:24:02,822 [myid:1] - INFO  
> [WorkerSender[myid=1]:QuorumCnxManager@278] - Have smaller server identifier, 
> so dropping the connection: (3, 1)
> 2017-04-25 09:24:03,025 [myid:1] - WARN  
> [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumCnxManager@457]
>  - Cannot open channel to 2 at election address /9.152.171.98:3888
> However, that's not all, since the quorum consisting of node 2 and 3 does not 
> work properly. The nodes' logs tell that leader election between these two 
> works fine.
> Here's what node 3 (leader) says:
>  2017-04-25 09:09:33,842 [myid:3] - INFO  
> [WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message 
> format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING 
> (n.state), 3 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config 
> version)
> 2017-04-25 09:09:33,844 [myid:3] - INFO  
> [WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message 
> format version), 2 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING 
> (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config 
> version)
> 2017-04-25 09:09:33,851 [myid:3] - INFO  
> [WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message 
> format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING 
> (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config 
> version)
> 2017-04-25 09:09:34,051 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):MBeanRegistry@128]
>  - Unregister MBean 
> [org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=LeaderElection]
> 2017-04-25 09:09:34,052 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1143]
>  - LEADING
> 2017-04-25 09:09:34,055 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@63] 
> - TCP NoDelay set to: true
> 2017-04-25 09:09:34,055 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@83] 
> - zookeeper.leader.maxConcurrentSnapshots = 10
> 2017-04-25 09:09:34,056 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@85] 
> - zookeeper.leader.maxConcurrentSnapshotTimeout = 5
> And here's the output node 2 (follower) provides:
> 2017-04-25 09:09:31,875 [myid:2] - INFO  
> [WorkerReceiver[myid=2]:FastLeaderElection@688] - Notification: 2 (message 
> format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING 
> (n.state), 2 (n.sid), 0x0 (
> n.peerEPoch), LOOKING (my state)0 (n.config version)
> 2017-04-25 09:09:32,077 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):MBeanRegistry@128]
>  - Unregister MBean 
> [org.apache.ZooKeeperService:name0=ReplicatedServer_id2,name1=repl
> ica.2,name2=LeaderElection]
> 2017-04-25 09:09:32,077 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1131]
>  - FOLLOWING
> 2017-04-25 09:09:32,082 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@88] 
> - TCP NoDelay set to: true
> So far, so good. But seconds later the connection between node 2 and 3 seems 
> to get lost, causing leader node 3 to report an EOFExeption. If I understand 
> the logs correctly, node 2 (follower) properly closes the connection (sending 
> "Goodbye"), whilst node 3 says that the socket is still open. 
> 2017-04-25 09:09:34,190 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@414] 
> - LEADING - LEADER ELECTION TOOK - 138 MS
> 2017-04-25 09:09:34,197 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):FileTxnSnapLog@320]
>  - Snapshotting: 0x0 to /data/version-2/snapshot.0
> 2017-04-25 09:09:35,076 [myid:3] - INFO  
> [LearnerHandler-/9.152.171.98:51328:LearnerHandler@382] - Follower sid: 2 : 
> info : 9.152.171.98:2888:3888:participant;0.0.0.0:2181
> 2017-04-25 09:09:35,113 [myid:3] - INFO  
> [LearnerHandler-/9.152.171.98:51328:LearnerHandler@683] - Synchronizing with 
> Follower sid: 2 maxCommittedLog=0x0 minCommittedLog=0x0 lastProcessedZxid=0x0 
> peerLastZxid=0x
> 0
> 2017-04-25 09:09:35,114 [myid:3] - INFO  
> [LearnerHandler-/9.152.171.98:51328:LearnerHandler@727] - Sending DIFF 
> zxid=0x0 for peer sid: 2
> 2017-04-25 09:09:35,133 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@1258]
>  - Have quorum of supporters, sids: [ [2, 3],[2, 3] ]; starting up and 
> setting last processed zxid: 0x100000000
> 2017-04-25 09:09:35,169 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):CommitProcessor@255]
>  - Configuring CommitProcessor with 2 worker threads.
> 2017-04-25 09:09:35,179 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):ContainerManager@64]
>  - Using checkIntervalMs=60000 maxPerMinute=10000
> 2017-04-25 09:09:35,196 [myid:3] - WARN  
> [RecvWorker:2:QuorumCnxManager$RecvWorker@919] - Connection broken for id 2, 
> my id = 3, error = 
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:904)
> 2017-04-25 09:09:35,196 [myid:3] - WARN  
> [RecvWorker:2:QuorumCnxManager$RecvWorker@922] - Interrupting SendWorker
> 2017-04-25 09:09:35,197 [myid:3] - WARN  
> [SendWorker:2:QuorumCnxManager$SendWorker@836] - Interrupted while waiting 
> for message on queue
> java.lang.InterruptedException
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>         at 
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:986)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:824)
> 2017-04-25 09:09:35,197 [myid:3] - WARN  
> [SendWorker:2:QuorumCnxManager$SendWorker@845] - Send worker leaving thread  
> id 2 my id = 3
> 2017-04-25 09:09:35,204 [myid:3] - ERROR 
> [LearnerHandler-/9.152.171.98:51328:LearnerHandler@604] - Unexpected 
> exception causing shutdown while sock still open
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at 
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at 
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
>         at 
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:515)
> 2017-04-25 09:09:35,204 [myid:3] - WARN  
> [LearnerHandler-/9.152.171.98:51328:LearnerHandler@619] - ******* GOODBYE 
> /9.152.171.98:51328 ********
> 2017-04-25 09:09:37,181 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@626] 
> - Shutting down
> 2017-04-25 09:09:37,182 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@632] 
> - Shutdown called
> java.lang.Exception: shutdown Leader! reason: Not sufficient followers 
> synced, only synced with sids: [ [3] ]
>         at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:632)
>         at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:612)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1146)
> Unfortunately, node 2 does not provide any additional information on what 
> exactly is going on. After leader election, the only thing it reports is this:
> 2017-04-25 09:09:32,091 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@68]
>  - FOLLOWING - LEADER ELECTION TOOK - 13 MS
> 2017-04-25 09:09:32,094 [myid:2] - WARN  
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@273]
>  - Unexpected exception, tries=0, remaining init limit=9999, connecting to 
> /9.152.171.12:288
> 8
> java.net.ConnectException: Connection refused (Connection refused)
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>         at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>         at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:589)
>         at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
>         at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
>         at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> 2017-04-25 09:09:33,142 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@369]
>  - Getting a diff from the leader 0x0
> 2017-04-25 09:09:33,146 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@516]
>  - Learner received NEWLEADER message
> 2017-04-25 09:09:33,207 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@499]
>  - Learner received UPTODATE message
> 2017-04-25 09:09:33,220 [myid:2] - WARN  
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1446]
>  - Restarting Leader Election
> 2017-04-25 09:09:33,221 [myid:2] - INFO  
> [/0.0.0.0:3888:QuorumCnxManager$Listener@665] - Leaving listener
> 2017-04-25 09:09:33,222 [myid:2] - WARN  
> [SendWorker:3:QuorumCnxManager$SendWorker@836] - Interrupted while waiting 
> for message on queue
> java.lang.InterruptedException
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>         at 
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:986)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:824)
> 2017-04-25 09:09:33,222 [myid:2] - WARN  
> [RecvWorker:3:QuorumCnxManager$RecvWorker@919] - Connection broken for id 3, 
> my id = 2, error = 
> java.net.SocketException: Socket closed
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>         at java.net.SocketInputStream.read(SocketInputStream.java:171)
>         at java.net.SocketInputStream.read(SocketInputStream.java:141)
>         at java.net.SocketInputStream.read(SocketInputStream.java:224)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:904)
> As far as I get that, node 2 wants to start a new leader election, but fails 
> to establish a connection to the other nodes. It tries over and over, finally 
> ending up in a timeout. Unfortunately, this doesn't give me any hint on what 
> exactly breaks up the connection between the follower node (node 2) and the 
> leader node (node 3) and why it can be re-established. 
> It might also be relevant that I'm running ZooKeeper in Docker containers, 
> using the host network option.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to