Robert Joseph Evans created ZOOKEEPER-2106:
----------------------------------------------
Summary: Error when reading from leader causes JVM to hang
Key: ZOOKEEPER-2106
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2106
Project: ZooKeeper
Issue Type: Bug
Affects Versions: 3.4.5
Reporter: Robert Joseph Evans
Priority: Critical
I tried looking through existing JIRA for something like this, but the closest
I came was ZOOKEEPER-2104. It looks very similar, but I don't know if it
really is the same thing. Essentially we had a 5 node ensemble for a large
storm cluster. For a few of the nodes at the same time they get an error that
looks like.
{code}
WARN [RecvWorker:2:QuorumCnxManager$RecvWorker@762] - Connection broken for id
2, my id = 4, error =
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:747)
WARN [RecvWorker:2:QuorumCnxManager$RecvWorker@765] - Interrupting SendWorker
WARN [SendWorker:2:QuorumCnxManager$SendWorker@679] - Interrupted while
waiting for message on queue
java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:831)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:62)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:667)
WARN [SendWorker:2:QuorumCnxManager$SendWorker@688] - Send worker leaving
thread
WARN [QuorumPeer[myid=4]/0.0.0.0:50512:Follower@89] - Exception when following
the leader
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:189)
at java.net.SocketInputStream.read(SocketInputStream.java:121)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
INFO [QuorumPeer[myid=4]/0.0.0.0:50512:Follower@166] - shutdown called
java.lang.Exception: shutdown Follower
at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:744)
{code}
After that all of the connections are shut down
{code}
INFO [QuorumPeer[myid=4]/0.0.0.0:50512:NIOServerCnxn@1001] - Closed socket
connection for client ...
{code}
but it does not manage to have the JVM shut down
{code}
INFO [QuorumPeer[myid=4]/0.0.0.0:50512:FollowerZooKeeperServer@139] - Shutting
down
INFO [QuorumPeer[myid=4]/0.0.0.0:50512:ZooKeeperServer@419] - shutting down
INFO [QuorumPeer[myid=4]/0.0.0.0:50512:FollowerRequestProcessor@105] -
Shutting down
INFO [QuorumPeer[myid=4]/0.0.0.0:50512:CommitProcessor@181] - Shutting down
INFO [FollowerRequestProcessor:4:FollowerRequestProcessor@95] -
FollowerRequestProcessor exited loop!
INFO [QuorumPeer[myid=4]/0.0.0.0:50512:FinalRequestProcessor@415] - shutdown
of request processor complete
INFO [CommitProcessor:4:CommitProcessor@150] - CommitProcessor exited loop!
WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:50512:NIOServerCnxn@354] -
Exception causing close of session 0x0 due to java.io.IOException:
ZooKeeperServer not running
INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:50512:NIOServerCnxn@1001] - Closed
socket connection for client /... (no session established for client)
INFO [QuorumPeer[myid=4]/0.0.0.0:50512:SyncRequestProcessor@175] - Shutting
down
INFO [SyncThread:4:SyncRequestProcessor@155] - SyncRequestProcessor exited!
INFO [QuorumPeer[myid=4]/0.0.0.0:50512:QuorumPeer@670] - LOOKING
{code}
after that all connections to that node initiate, and then are shut down with
ZooKeeperServer not running. It seems to stay in this state indefinitely until
the process is manually restarted. After that it recovers.
We have seen this happen on multiple servers at the same time resulting in the
entire ensemble being unusable.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)