Patrick Kleindienst created ZOOKEEPER-2766:
----------------------------------------------

             Summary: Leader dies with java.io.EOFException
                 Key: ZOOKEEPER-2766
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2766
             Project: ZooKeeper
          Issue Type: Bug
          Components: leaderElection, quorum
    Affects Versions: 3.5.3
         Environment: CentOS-7, Docker version 17.03.1-ce
            Reporter: Patrick Kleindienst


When I start a ZooKeeper ensemble comprising 3 nodes, I'm currently facing the 
following behavior:
Two nodes (let's say node 2 and 3) out of the three start their own quorum, and 
finally one of them is elected the new leader (node 3), while the other one 
becomes the follower (node 2). Node 1 seems to be able to establish a 
connection to node 3 (elected leader) in my case, but this seems to fail for 
node 2. 
Node 1 shows the following in its logs:

 2017-04-25 09:24:02,806 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1055]
 - LOOKING
2017-04-25 09:24:02,808 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):FastLeaderElection@894]
 - New election. My id =  1, proposed zxid=0x0
2017-04-25 09:24:02,811 [myid:1] - INFO  
[WorkerReceiver[myid=1]:FastLeaderElection@688] - Notification: 2 (message 
format version), 1 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 
1 (n.sid), 0x0 (
n.peerEPoch), LOOKING (my state)0 (n.config version)
2017-04-25 09:24:02,817 [myid:1] - WARN  
[WorkerSender[myid=1]:QuorumCnxManager@457] - Cannot open channel to 2 at 
election address /9.152.171.98:3888
java.net.ConnectException: Connection refused (Connection refused)
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:443)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:486)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:421)
        at 
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:486)
        at 
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:465)
        at java.lang.Thread.run(Thread.java:745)
2017-04-25 09:24:02,822 [myid:1] - INFO  
[WorkerSender[myid=1]:QuorumCnxManager@278] - Have smaller server identifier, 
so dropping the connection: (3, 1)
2017-04-25 09:24:03,025 [myid:1] - WARN  
[QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumCnxManager@457]
 - Cannot open channel to 2 at election address /9.152.171.98:3888

However, that's not all, since the quorum consisting of node 2 and 3 does not 
work properly. The nodes' logs tell that leader election between these two 
works fine.
Here's what node 3 (leader) says:

 2017-04-25 09:09:33,842 [myid:3] - INFO  
[WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message 
format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 
3 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
2017-04-25 09:09:33,844 [myid:3] - INFO  
[WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message 
format version), 2 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 
2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
2017-04-25 09:09:33,851 [myid:3] - INFO  
[WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message 
format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 
2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
2017-04-25 09:09:34,051 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):MBeanRegistry@128]
 - Unregister MBean 
[org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=LeaderElection]
2017-04-25 09:09:34,052 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1143]
 - LEADING
2017-04-25 09:09:34,055 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@63] - 
TCP NoDelay set to: true
2017-04-25 09:09:34,055 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@83] - 
zookeeper.leader.maxConcurrentSnapshots = 10
2017-04-25 09:09:34,056 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@85] - 
zookeeper.leader.maxConcurrentSnapshotTimeout = 5


And here's the output node 2 (follower) provides:

2017-04-25 09:09:31,875 [myid:2] - INFO  
[WorkerReceiver[myid=2]:FastLeaderElection@688] - Notification: 2 (message 
format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 
2 (n.sid), 0x0 (
n.peerEPoch), LOOKING (my state)0 (n.config version)
2017-04-25 09:09:32,077 [myid:2] - INFO  
[QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):MBeanRegistry@128]
 - Unregister MBean 
[org.apache.ZooKeeperService:name0=ReplicatedServer_id2,name1=repl
ica.2,name2=LeaderElection]
2017-04-25 09:09:32,077 [myid:2] - INFO  
[QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1131]
 - FOLLOWING
2017-04-25 09:09:32,082 [myid:2] - INFO  
[QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@88] - 
TCP NoDelay set to: true

So far, so good. But seconds later the connection between node 2 and 3 seems to 
get lost, causing leader node 3 to report an EOFExeption. If I understand the 
logs correctly, node 2 (follower) properly closes the connection (sending 
"Goodbye"), whilst node 3 says that the socket is still open. 

2017-04-25 09:09:34,190 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@414] - 
LEADING - LEADER ELECTION TOOK - 138 MS
2017-04-25 09:09:34,197 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):FileTxnSnapLog@320]
 - Snapshotting: 0x0 to /data/version-2/snapshot.0
2017-04-25 09:09:35,076 [myid:3] - INFO  
[LearnerHandler-/9.152.171.98:51328:LearnerHandler@382] - Follower sid: 2 : 
info : 9.152.171.98:2888:3888:participant;0.0.0.0:2181
2017-04-25 09:09:35,113 [myid:3] - INFO  
[LearnerHandler-/9.152.171.98:51328:LearnerHandler@683] - Synchronizing with 
Follower sid: 2 maxCommittedLog=0x0 minCommittedLog=0x0 lastProcessedZxid=0x0 
peerLastZxid=0x
0
2017-04-25 09:09:35,114 [myid:3] - INFO  
[LearnerHandler-/9.152.171.98:51328:LearnerHandler@727] - Sending DIFF zxid=0x0 
for peer sid: 2
2017-04-25 09:09:35,133 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@1258] 
- Have quorum of supporters, sids: [ [2, 3],[2, 3] ]; starting up and setting 
last processed zxid: 0x100000000
2017-04-25 09:09:35,169 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):CommitProcessor@255]
 - Configuring CommitProcessor with 2 worker threads.
2017-04-25 09:09:35,179 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):ContainerManager@64]
 - Using checkIntervalMs=60000 maxPerMinute=10000
2017-04-25 09:09:35,196 [myid:3] - WARN  
[RecvWorker:2:QuorumCnxManager$RecvWorker@919] - Connection broken for id 2, my 
id = 3, error = 
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:904)
2017-04-25 09:09:35,196 [myid:3] - WARN  
[RecvWorker:2:QuorumCnxManager$RecvWorker@922] - Interrupting SendWorker
2017-04-25 09:09:35,197 [myid:3] - WARN  
[SendWorker:2:QuorumCnxManager$SendWorker@836] - Interrupted while waiting for 
message on queue
java.lang.InterruptedException
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
        at 
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:986)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:824)
2017-04-25 09:09:35,197 [myid:3] - WARN  
[SendWorker:2:QuorumCnxManager$SendWorker@845] - Send worker leaving thread  id 
2 my id = 3
2017-04-25 09:09:35,204 [myid:3] - ERROR 
[LearnerHandler-/9.152.171.98:51328:LearnerHandler@604] - Unexpected exception 
causing shutdown while sock still open
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at 
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
        at 
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
        at 
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:515)
2017-04-25 09:09:35,204 [myid:3] - WARN  
[LearnerHandler-/9.152.171.98:51328:LearnerHandler@619] - ******* GOODBYE 
/9.152.171.98:51328 ********
2017-04-25 09:09:37,181 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@626] - 
Shutting down
2017-04-25 09:09:37,182 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@632] - 
Shutdown called
java.lang.Exception: shutdown Leader! reason: Not sufficient followers synced, 
only synced with sids: [ [3] ]
        at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:632)
        at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:612)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1146)

Unfortunately, node 2 does not provide any additional information on what 
exactly is going on. After leader election, the only thing it reports is this:

2017-04-25 09:09:32,091 [myid:2] - INFO  
[QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@68] 
- FOLLOWING - LEADER ELECTION TOOK - 13 MS
2017-04-25 09:09:32,094 [myid:2] - WARN  
[QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@273] 
- Unexpected exception, tries=0, remaining init limit=9999, connecting to 
/9.152.171.12:288
8
java.net.ConnectException: Connection refused (Connection refused)
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at 
org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
        at 
org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
        at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
2017-04-25 09:09:33,142 [myid:2] - INFO  
[QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@369] 
- Getting a diff from the leader 0x0
2017-04-25 09:09:33,146 [myid:2] - INFO  
[QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@516] 
- Learner received NEWLEADER message
2017-04-25 09:09:33,207 [myid:2] - INFO  
[QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@499] 
- Learner received UPTODATE message
2017-04-25 09:09:33,220 [myid:2] - WARN  
[QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1446]
 - Restarting Leader Election
2017-04-25 09:09:33,221 [myid:2] - INFO  
[/0.0.0.0:3888:QuorumCnxManager$Listener@665] - Leaving listener
2017-04-25 09:09:33,222 [myid:2] - WARN  
[SendWorker:3:QuorumCnxManager$SendWorker@836] - Interrupted while waiting for 
message on queue
java.lang.InterruptedException
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
        at 
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:986)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:824)
2017-04-25 09:09:33,222 [myid:2] - WARN  
[RecvWorker:3:QuorumCnxManager$RecvWorker@919] - Connection broken for id 3, my 
id = 2, error = 
java.net.SocketException: Socket closed
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.net.SocketInputStream.read(SocketInputStream.java:224)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:904)

As far as I get that, node 2 wants to start a new leader election, but fails to 
establish a connection to the other nodes. It tries over and over, finally 
ending up in a timeout. Unfortunately, this doesn't give me any hint on what 
exactly breaks up the connection between the follower node (node 2) and the 
leader node (node 3) and why it can be re-established. 

It might also be relevant that I'm running ZooKeeper in Docker containers, 
using the host network option.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to