[
https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15395387#comment-15395387
]
Daniel Freudenberger commented on ZOOKEEPER-2104:
-------------------------------------------------
[~fpj] the referenced snapshot is the latest one and all snapshots have around
the same size. The log files look pretty much all the same to me. But maybe
your eyes will catch sth. that looked good to me. You can download the log
files for all nodes if you like
(https://s3-eu-west-1.amazonaws.com/files.rebuy-cdn.de/logs.tgz). The
transaction log is written to the same device. This is sth. we can't really
change in our current setup.
> Sudden crash of all nodes in the cluster
> ----------------------------------------
>
> Key: ZOOKEEPER-2104
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.4.6
> Reporter: Benjamin Jaton
> Attachments: zookeeper-errors.txt, zookeeper-warns.txt
>
>
> In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying
> "ZooKeeper is not running" messages.
> Not retry seems to be happening after that.
> This a request to understand what happened and probably to improve the logs
> when it does.
> See logs below:
> NODE1:
> -- no log for several days before this --
> 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] -
> fsync-ing the write ahead log in SyncThread:1 took 11024ms which will
> adversely effect operation latency. See the ZooKeeper troubleshooting guide
> 2015-01-04 16:18:22,380 [myid:1] - WARN
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> following the leader
> java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> at
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
> at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 2015-01-04 16:18:23,384 [myid:1] - WARN
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not
> running
> 2015-01-04 16:18:23,492 [myid:1] - WARN
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not
> running
> 2015-01-04 16:18:24,060 [myid:1] - WARN
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not
> running
> NODE2:
> -- no log for several days before this --
> 2015-01-04 16:18:21,899 [myid:3] - WARN
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> following the leader
> java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> at
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
> at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 2015-01-04 16:18:22,760 [myid:3] - WARN
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not
> running
> 2015-01-04 16:18:22,801 [myid:3] - WARN
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not
> running
> 2015-01-04 16:18:22,886 [myid:3] - WARN
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not
> running
> NODE3 (leader):
> -- no log for several days before this --
> 2015-01-04 16:18:21,897 [myid:2] - WARN
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing
> connection to peer due to transaction timeout.
> 2015-01-04 16:18:21,898 [myid:2] - WARN
> [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - ******* GOODBYE
> /204.53.107.249:43402 ********
> 2015-01-04 16:18:21,905 [myid:2] - WARN
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing
> connection to peer due to transaction timeout.
> 2015-01-04 16:18:21,907 [myid:2] - WARN
> [LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - ******* GOODBYE
> /204.53.107.247:45953 ********
> 2015-01-04 16:18:21,918 [myid:2] - WARN
> [LearnerHandler-/204.53.107.247:45953:LearnerHandler@658] - Ignoring
> unexpected exception
> java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
> at
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
> at
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
> at
> org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:656)
> at
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:649)
> 2015-01-04 16:18:23,003 [myid:2] - WARN
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not
> running
> 2015-01-04 16:18:23,007 [myid:2] - WARN
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not
> running
> 2015-01-04 16:18:23,115 [myid:2] - WARN
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not
> running
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)