[
https://issues.apache.org/jira/browse/ZOOKEEPER-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14275529#comment-14275529
]
Robert Joseph Evans commented on ZOOKEEPER-2106:
------------------------------------------------
Actually digging further it looks like the ensemble would have recovered
eventually. But it was happening so often that we thought it was not
recovering. I'll dig in deeper to try to get the root cause of the drops, and
open a new JIRA if it is related to ZK. Sorry to have filed a JIRA without
seeing all of what was going on.
> Error when reading from leader causes JVM to hang
> -------------------------------------------------
>
> Key: ZOOKEEPER-2106
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2106
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.4.5
> Reporter: Robert Joseph Evans
> Priority: Critical
>
> I tried looking through existing JIRA for something like this, but the
> closest I came was ZOOKEEPER-2104. It looks very similar, but I don't know
> if it really is the same thing. Essentially we had a 5 node ensemble for a
> large storm cluster. For a few of the nodes at the same time they get an
> error that looks like.
> {code}
> WARN [RecvWorker:2:QuorumCnxManager$RecvWorker@762] - Connection broken for
> id 2, my id = 4, error =
> java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:747)
> WARN [RecvWorker:2:QuorumCnxManager$RecvWorker@765] - Interrupting SendWorker
> WARN [SendWorker:2:QuorumCnxManager$SendWorker@679] - Interrupted while
> waiting for message on queue
> java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
> at
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:831)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:62)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:667)
> WARN [SendWorker:2:QuorumCnxManager$SendWorker@688] - Send worker leaving
> thread
> WARN [QuorumPeer[myid=4]/0.0.0.0:50512:Follower@89] - Exception when
> following the leader
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:189)
> at java.net.SocketInputStream.read(SocketInputStream.java:121)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
> at
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
> at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
> at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
> INFO [QuorumPeer[myid=4]/0.0.0.0:50512:Follower@166] - shutdown called
> java.lang.Exception: shutdown Follower
> at
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
> at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:744)
> {code}
> After that all of the connections are shut down
> {code}
> INFO [QuorumPeer[myid=4]/0.0.0.0:50512:NIOServerCnxn@1001] - Closed socket
> connection for client ...
> {code}
> but it does not manage to have the JVM shut down
> {code}
> INFO [QuorumPeer[myid=4]/0.0.0.0:50512:FollowerZooKeeperServer@139] -
> Shutting down
> INFO [QuorumPeer[myid=4]/0.0.0.0:50512:ZooKeeperServer@419] - shutting down
> INFO [QuorumPeer[myid=4]/0.0.0.0:50512:FollowerRequestProcessor@105] -
> Shutting down
> INFO [QuorumPeer[myid=4]/0.0.0.0:50512:CommitProcessor@181] - Shutting down
> INFO [FollowerRequestProcessor:4:FollowerRequestProcessor@95] -
> FollowerRequestProcessor exited loop!
> INFO [QuorumPeer[myid=4]/0.0.0.0:50512:FinalRequestProcessor@415] - shutdown
> of request processor complete
> INFO [CommitProcessor:4:CommitProcessor@150] - CommitProcessor exited loop!
> WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:50512:NIOServerCnxn@354] -
> Exception causing close of session 0x0 due to java.io.IOException:
> ZooKeeperServer not running
> INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:50512:NIOServerCnxn@1001] -
> Closed socket connection for client /... (no session established for client)
> INFO [QuorumPeer[myid=4]/0.0.0.0:50512:SyncRequestProcessor@175] - Shutting
> down
> INFO [SyncThread:4:SyncRequestProcessor@155] - SyncRequestProcessor exited!
> INFO [QuorumPeer[myid=4]/0.0.0.0:50512:QuorumPeer@670] - LOOKING
> {code}
> after that all connections to that node initiate, and then are shut down with
> ZooKeeperServer not running. It seems to stay in this state indefinitely
> until the process is manually restarted. After that it recovers.
> We have seen this happen on multiple servers at the same time resulting in
> the entire ensemble being unusable.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)