Robert Joseph Evans created ZOOKEEPER-2106:
----------------------------------------------

             Summary: Error when reading from leader causes JVM to hang
                 Key: ZOOKEEPER-2106
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2106
             Project: ZooKeeper
          Issue Type: Bug
    Affects Versions: 3.4.5
            Reporter: Robert Joseph Evans
            Priority: Critical


I tried looking through existing JIRA for something like this, but the closest 
I came was ZOOKEEPER-2104.  It looks very similar, but I don't know if it 
really is the same thing.  Essentially we had a 5 node ensemble for a large 
storm cluster.  For a few of the nodes at the same time they get an error that 
looks like.

{code}
WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@762] - Connection broken for id 
2, my id = 4, error = 
java.io.EOFException
      at java.io.DataInputStream.readInt(DataInputStream.java:392)
      at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:747)
WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@765] - Interrupting SendWorker
WARN  [SendWorker:2:QuorumCnxManager$SendWorker@679] - Interrupted while 
waiting for message on queue
java.lang.InterruptedException
     at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
      at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
      at 
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
      at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:831)
      at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:62)
     at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:667)
WARN  [SendWorker:2:QuorumCnxManager$SendWorker@688] - Send worker leaving 
thread
WARN  [QuorumPeer[myid=4]/0.0.0.0:50512:Follower@89] - Exception when following 
the leader
java.net.SocketException: Connection reset
     at java.net.SocketInputStream.read(SocketInputStream.java:189)
     at java.net.SocketInputStream.read(SocketInputStream.java:121)
     at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
     at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
     at java.io.DataInputStream.readInt(DataInputStream.java:387)
     at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
     at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
     at 
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
     at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
     at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
    at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
INFO  [QuorumPeer[myid=4]/0.0.0.0:50512:Follower@166] - shutdown called
java.lang.Exception: shutdown Follower
      at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:744)
{code}

After that all of the connections are shut down
{code}
INFO  [QuorumPeer[myid=4]/0.0.0.0:50512:NIOServerCnxn@1001] - Closed socket 
connection for client ...
{code}

but it does not manage to have the JVM shut down

{code}
INFO  [QuorumPeer[myid=4]/0.0.0.0:50512:FollowerZooKeeperServer@139] - Shutting 
down
INFO  [QuorumPeer[myid=4]/0.0.0.0:50512:ZooKeeperServer@419] - shutting down
INFO  [QuorumPeer[myid=4]/0.0.0.0:50512:FollowerRequestProcessor@105] - 
Shutting down
INFO  [QuorumPeer[myid=4]/0.0.0.0:50512:CommitProcessor@181] - Shutting down
INFO  [FollowerRequestProcessor:4:FollowerRequestProcessor@95] - 
FollowerRequestProcessor exited loop!
INFO  [QuorumPeer[myid=4]/0.0.0.0:50512:FinalRequestProcessor@415] - shutdown 
of request processor complete
INFO  [CommitProcessor:4:CommitProcessor@150] - CommitProcessor exited loop!
WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:50512:NIOServerCnxn@354] - 
Exception causing close of session 0x0 due to java.io.IOException: 
ZooKeeperServer not running
INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:50512:NIOServerCnxn@1001] - Closed 
socket connection for client /... (no session established for client)
INFO  [QuorumPeer[myid=4]/0.0.0.0:50512:SyncRequestProcessor@175] - Shutting 
down
INFO  [SyncThread:4:SyncRequestProcessor@155] - SyncRequestProcessor exited!
INFO  [QuorumPeer[myid=4]/0.0.0.0:50512:QuorumPeer@670] - LOOKING
{code}

after that all connections to that node initiate, and then are shut down with 
ZooKeeperServer not running.  It seems to stay in this state indefinitely until 
the process is manually restarted.  After that it recovers.

We have seen this happen on multiple servers at the same time resulting in the 
entire ensemble being unusable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to