[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093180#comment-16093180
 ] 

prashantkumar commented on ZOOKEEPER-2766:
------------------------------------------

I am also seeing exactly same issue 
i had logged https://issues.apache.org/jira/browse/ZOOKEEPER-2848

> Quorum fails with java.io.EOFException
> --------------------------------------
>
>                 Key: ZOOKEEPER-2766
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2766
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection, quorum
>    Affects Versions: 3.5.3
>         Environment: CentOS-7, Docker version 17.03.1-ce
>            Reporter: Patrick Kleindienst
>         Attachments: logs-node1.txt, logs-node2.txt, logs-node3.txt
>
>
> When I start a ZooKeeper ensemble comprising 3 nodes, I'm currently facing 
> the following behavior:
> Two nodes (let's say node 2 and 3) out of the three start their own quorum, 
> and finally one of them is elected the new leader (node 3), while the other 
> one becomes the follower (node 2). Node 1 seems to be able to establish a 
> connection to node 3 (elected leader) in my case, but this seems to fail for 
> node 2. 
> Node 1 shows the following in its logs:
>  2017-04-25 09:24:02,806 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1055]
>  - LOOKING
> 2017-04-25 09:24:02,808 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):FastLeaderElection@894]
>  - New election. My id =  1, proposed zxid=0x0
> 2017-04-25 09:24:02,811 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection@688] - Notification: 2 (message 
> format version), 1 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING 
> (n.state), 1 (n.sid), 0x0 (
> n.peerEPoch), LOOKING (my state)0 (n.config version)
> 2017-04-25 09:24:02,817 [myid:1] - WARN  
> [WorkerSender[myid=1]:QuorumCnxManager@457] - Cannot open channel to 2 at 
> election address /9.152.171.98:3888
> java.net.ConnectException: Connection refused (Connection refused)
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>         at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>         at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:589)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:443)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:486)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:421)
>         at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:486)
>         at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:465)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-04-25 09:24:02,822 [myid:1] - INFO  
> [WorkerSender[myid=1]:QuorumCnxManager@278] - Have smaller server identifier, 
> so dropping the connection: (3, 1)
> 2017-04-25 09:24:03,025 [myid:1] - WARN  
> [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumCnxManager@457]
>  - Cannot open channel to 2 at election address /9.152.171.98:3888
> However, that's not all, since the quorum consisting of node 2 and 3 does not 
> work properly. The nodes' logs tell that leader election between these two 
> works fine.
> Here's what node 3 (leader) says:
>  2017-04-25 09:09:33,842 [myid:3] - INFO  
> [WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message 
> format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING 
> (n.state), 3 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config 
> version)
> 2017-04-25 09:09:33,844 [myid:3] - INFO  
> [WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message 
> format version), 2 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING 
> (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config 
> version)
> 2017-04-25 09:09:33,851 [myid:3] - INFO  
> [WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message 
> format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING 
> (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config 
> version)
> 2017-04-25 09:09:34,051 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):MBeanRegistry@128]
>  - Unregister MBean 
> [org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=LeaderElection]
> 2017-04-25 09:09:34,052 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1143]
>  - LEADING
> 2017-04-25 09:09:34,055 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@63] 
> - TCP NoDelay set to: true
> 2017-04-25 09:09:34,055 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@83] 
> - zookeeper.leader.maxConcurrentSnapshots = 10
> 2017-04-25 09:09:34,056 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@85] 
> - zookeeper.leader.maxConcurrentSnapshotTimeout = 5
> And here's the output node 2 (follower) provides:
> 2017-04-25 09:09:31,875 [myid:2] - INFO  
> [WorkerReceiver[myid=2]:FastLeaderElection@688] - Notification: 2 (message 
> format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING 
> (n.state), 2 (n.sid), 0x0 (
> n.peerEPoch), LOOKING (my state)0 (n.config version)
> 2017-04-25 09:09:32,077 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):MBeanRegistry@128]
>  - Unregister MBean 
> [org.apache.ZooKeeperService:name0=ReplicatedServer_id2,name1=repl
> ica.2,name2=LeaderElection]
> 2017-04-25 09:09:32,077 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1131]
>  - FOLLOWING
> 2017-04-25 09:09:32,082 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@88] 
> - TCP NoDelay set to: true
> So far, so good. But seconds later the connection between node 2 and 3 seems 
> to get lost, causing leader node 3 to report an EOFExeption. If I understand 
> the logs correctly, node 2 (follower) properly closes the connection (sending 
> "Goodbye"), whilst node 3 says that the socket is still open. 
> 2017-04-25 09:09:34,190 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@414] 
> - LEADING - LEADER ELECTION TOOK - 138 MS
> 2017-04-25 09:09:34,197 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):FileTxnSnapLog@320]
>  - Snapshotting: 0x0 to /data/version-2/snapshot.0
> 2017-04-25 09:09:35,076 [myid:3] - INFO  
> [LearnerHandler-/9.152.171.98:51328:LearnerHandler@382] - Follower sid: 2 : 
> info : 9.152.171.98:2888:3888:participant;0.0.0.0:2181
> 2017-04-25 09:09:35,113 [myid:3] - INFO  
> [LearnerHandler-/9.152.171.98:51328:LearnerHandler@683] - Synchronizing with 
> Follower sid: 2 maxCommittedLog=0x0 minCommittedLog=0x0 lastProcessedZxid=0x0 
> peerLastZxid=0x
> 0
> 2017-04-25 09:09:35,114 [myid:3] - INFO  
> [LearnerHandler-/9.152.171.98:51328:LearnerHandler@727] - Sending DIFF 
> zxid=0x0 for peer sid: 2
> 2017-04-25 09:09:35,133 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@1258]
>  - Have quorum of supporters, sids: [ [2, 3],[2, 3] ]; starting up and 
> setting last processed zxid: 0x100000000
> 2017-04-25 09:09:35,169 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):CommitProcessor@255]
>  - Configuring CommitProcessor with 2 worker threads.
> 2017-04-25 09:09:35,179 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):ContainerManager@64]
>  - Using checkIntervalMs=60000 maxPerMinute=10000
> 2017-04-25 09:09:35,196 [myid:3] - WARN  
> [RecvWorker:2:QuorumCnxManager$RecvWorker@919] - Connection broken for id 2, 
> my id = 3, error = 
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:904)
> 2017-04-25 09:09:35,196 [myid:3] - WARN  
> [RecvWorker:2:QuorumCnxManager$RecvWorker@922] - Interrupting SendWorker
> 2017-04-25 09:09:35,197 [myid:3] - WARN  
> [SendWorker:2:QuorumCnxManager$SendWorker@836] - Interrupted while waiting 
> for message on queue
> java.lang.InterruptedException
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>         at 
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:986)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:824)
> 2017-04-25 09:09:35,197 [myid:3] - WARN  
> [SendWorker:2:QuorumCnxManager$SendWorker@845] - Send worker leaving thread  
> id 2 my id = 3
> 2017-04-25 09:09:35,204 [myid:3] - ERROR 
> [LearnerHandler-/9.152.171.98:51328:LearnerHandler@604] - Unexpected 
> exception causing shutdown while sock still open
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at 
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at 
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
>         at 
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:515)
> 2017-04-25 09:09:35,204 [myid:3] - WARN  
> [LearnerHandler-/9.152.171.98:51328:LearnerHandler@619] - ******* GOODBYE 
> /9.152.171.98:51328 ********
> 2017-04-25 09:09:37,181 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@626] 
> - Shutting down
> 2017-04-25 09:09:37,182 [myid:3] - INFO  
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@632] 
> - Shutdown called
> java.lang.Exception: shutdown Leader! reason: Not sufficient followers 
> synced, only synced with sids: [ [3] ]
>         at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:632)
>         at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:612)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1146)
> Unfortunately, node 2 does not provide any additional information on what 
> exactly is going on. After leader election, the only thing it reports is this:
> 2017-04-25 09:09:32,091 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@68]
>  - FOLLOWING - LEADER ELECTION TOOK - 13 MS
> 2017-04-25 09:09:32,094 [myid:2] - WARN  
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@273]
>  - Unexpected exception, tries=0, remaining init limit=9999, connecting to 
> /9.152.171.12:288
> 8
> java.net.ConnectException: Connection refused (Connection refused)
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>         at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>         at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:589)
>         at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
>         at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
>         at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> 2017-04-25 09:09:33,142 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@369]
>  - Getting a diff from the leader 0x0
> 2017-04-25 09:09:33,146 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@516]
>  - Learner received NEWLEADER message
> 2017-04-25 09:09:33,207 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@499]
>  - Learner received UPTODATE message
> 2017-04-25 09:09:33,220 [myid:2] - WARN  
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1446]
>  - Restarting Leader Election
> 2017-04-25 09:09:33,221 [myid:2] - INFO  
> [/0.0.0.0:3888:QuorumCnxManager$Listener@665] - Leaving listener
> 2017-04-25 09:09:33,222 [myid:2] - WARN  
> [SendWorker:3:QuorumCnxManager$SendWorker@836] - Interrupted while waiting 
> for message on queue
> java.lang.InterruptedException
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>         at 
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:986)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:824)
> 2017-04-25 09:09:33,222 [myid:2] - WARN  
> [RecvWorker:3:QuorumCnxManager$RecvWorker@919] - Connection broken for id 3, 
> my id = 2, error = 
> java.net.SocketException: Socket closed
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>         at java.net.SocketInputStream.read(SocketInputStream.java:171)
>         at java.net.SocketInputStream.read(SocketInputStream.java:141)
>         at java.net.SocketInputStream.read(SocketInputStream.java:224)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:904)
> As far as I get that, node 2 wants to start a new leader election, but fails 
> to establish a connection to the other nodes. It tries over and over, finally 
> ending up in a timeout. Unfortunately, this doesn't give me any hint on what 
> exactly breaks up the connection between the follower node (node 2) and the 
> leader node (node 3) and why it can be re-established. 
> It might also be relevant that I'm running ZooKeeper in Docker containers, 
> using the host network option.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to