[
https://issues.apache.org/jira/browse/ZOOKEEPER-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435844#comment-15435844
]
Ramnatthan Alagappan commented on ZOOKEEPER-2528:
-------------------------------------------------
Hi Patrick,
Here is the log of one server when it restarts from the crash I have mentioned
in the initial description. I have slightly anonymized the actual directory
names and actual server ips.
2016-04-14 20:30:10,350 [myid:] - INFO [main:QuorumPeerConfig@103] - Reading
configuration from: /tmp/zoo2.cfg
2016-04-14 20:30:10,364 [myid:] - INFO [main:QuorumPeer$QuorumServer@149] -
Resolved hostname: ip1 to address: /ip1
2016-04-14 20:30:10,364 [myid:] - INFO [main:QuorumPeer$QuorumServer@149] -
Resolved hostname: ip3 to address: /ip3
2016-04-14 20:30:10,364 [myid:] - INFO [main:QuorumPeer$QuorumServer@149] -
Resolved hostname: ip2 to address: /ip2
2016-04-14 20:30:10,364 [myid:] - INFO [main:QuorumPeerConfig@331] -
Defaulting to majority quorums
2016-04-14 20:30:10,367 [myid:1] - INFO [main:DatadirCleanupManager@78] -
autopurge.snapRetainCount set to 3
2016-04-14 20:30:10,367 [myid:1] - INFO [main:DatadirCleanupManager@79] -
autopurge.purgeInterval set to 0
2016-04-14 20:30:10,367 [myid:1] - INFO [main:DatadirCleanupManager@101] -
Purge task is not scheduled.
2016-04-14 20:30:10,376 [myid:1] - INFO [main:QuorumPeerMain@127] - Starting
quorum peer
2016-04-14 20:30:10,384 [myid:1] - INFO [main:NIOServerCnxnFactory@89] -
binding to port 0.0.0.0/0.0.0.0:2182
2016-04-14 20:30:10,389 [myid:1] - INFO [main:QuorumPeer@1019] - tickTime set
to 2000
2016-04-14 20:30:10,389 [myid:1] - INFO [main:QuorumPeer@1039] -
minSessionTimeout set to -1
2016-04-14 20:30:10,389 [myid:1] - INFO [main:QuorumPeer@1050] -
maxSessionTimeout set to -1
2016-04-14 20:30:10,389 [myid:1] - INFO [main:QuorumPeer@1065] - initLimit set
to 5
2016-04-14 20:30:10,398 [myid:1] - INFO [main:FileSnap@83] - Reading snapshot
data_dir/version-2/snapshot.100000002
2016-04-14 20:30:10,404 [myid:1] - ERROR [main:QuorumPeer@557] - Unable to load
database on disk
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at
org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:581)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:600)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:566)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:648)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:552)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:527)
at
org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:354)
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:132)
at
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:510)
at
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:500)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:153)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
2016-04-14 20:30:10,406 [myid:1] - ERROR [main:QuorumPeerMain@89] - Unexpected
exception, exiting abnormally
java.lang.RuntimeException: Unable to run quorum server
at
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:558)
at
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:500)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:153)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at
org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:581)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:600)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:566)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:648)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:552)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:527)
at
org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:354)
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:132)
at
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:510)
... 4 more
> ZooKeeper cluster can become unavailable due to power failures
> --------------------------------------------------------------
>
> Key: ZOOKEEPER-2528
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2528
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.4.8
> Environment: A normal ZooKeeper cluster of 3 nodes running on 3 Linux
> machines.
> Reporter: Ramnatthan Alagappan
> Assignee: Abraham Fine
> Priority: Critical
>
> ZooKeeper cluster can become unavailable if power failures happen at certain
> specific points in time.
> Details:
> I am running a three-node ZooKeeper cluster. I perform a simple update from a
> client machine.
> When I try to update a value, ZooKeeper creates a new log file (for example,
> when the current log is fully utilized). First, it creates the file and
> appends some header information to the newly created log. The system call
> sequence looks like below:
> creat(log.200000001)
> append(log.200000001, offset=0, count=16)
> Now, if a power failure happens just after the creat of the log file but
> before the append of the header information, the node simply crashes with an
> EOF exception. If the same problem occurs at two or more nodes in my
> three-node cluster, the entire cluster becomes unavailable as the majority of
> servers have crashed because of the above problem.
> A power failure at the same time across multiple nodes may be possible in
> single data center or single rack deployment scenarios.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)