[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056947#comment-15056947
 ] 

David Lao commented on ZOOKEEPER-2104:
--------------------------------------

I'm seeing the same issue. Additional observations: when the ensemble gets into 
this state, the leader and followers appears to be replicating massive amount 
of data although the curator clients were mostly idle. The on disk foot print 
for the log and snapshot are 65MB and 500MB respectively and there are plenty 
of memory and disk space on the servers. 

> Sudden crash of all nodes in the cluster
> ----------------------------------------
>
>                 Key: ZOOKEEPER-2104
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.6
>            Reporter: Benjamin Jaton
>
> In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying 
> "ZooKeeper is not running" messages.
> Not retry seems to be happening after that.
> This a request to understand what happened and probably to improve the logs 
> when it does.
> See logs below:
> NODE1:
> -- no log for several days before this --
> 2015-01-04 16:18:22,259 [myid:1] - WARN  [SyncThread:1:FileTxnLog@321] - 
> fsync-ing the write ahead log in SyncThread:1 took 11024ms which will 
> adversely effect operation latency. See the ZooKeeper troubleshooting guide
> 2015-01-04 16:18:22,380 [myid:1] - WARN  
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when 
> following the leader
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at 
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at 
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>         at 
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
>         at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 2015-01-04 16:18:23,384 [myid:1] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> 2015-01-04 16:18:23,492 [myid:1] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> 2015-01-04 16:18:24,060 [myid:1] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> NODE2:
> -- no log for several days before this --
> 2015-01-04 16:18:21,899 [myid:3] - WARN  
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when 
> following the leader
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at 
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at 
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>         at 
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
>         at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 2015-01-04 16:18:22,760 [myid:3] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> 2015-01-04 16:18:22,801 [myid:3] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> 2015-01-04 16:18:22,886 [myid:3] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> NODE3 (leader):
> -- no log for several days before this --
> 2015-01-04 16:18:21,897 [myid:2] - WARN  
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing 
> connection to peer due to transaction timeout.
> 2015-01-04 16:18:21,898 [myid:2] - WARN  
> [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - ******* GOODBYE 
> /204.53.107.249:43402 ********
> 2015-01-04 16:18:21,905 [myid:2] - WARN  
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing 
> connection to peer due to transaction timeout.
> 2015-01-04 16:18:21,907 [myid:2] - WARN  
> [LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - ******* GOODBYE 
> /204.53.107.247:45953 ********
> 2015-01-04 16:18:21,918 [myid:2] - WARN  
> [LearnerHandler-/204.53.107.247:45953:LearnerHandler@658] - Ignoring 
> unexpected exception
> java.lang.InterruptedException
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>         at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>         at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>         at 
> org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:656)
>         at 
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:649)
> 2015-01-04 16:18:23,003 [myid:2] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> 2015-01-04 16:18:23,007 [myid:2] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> 2015-01-04 16:18:23,115 [myid:2] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to