[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435844#comment-15435844
 ] 

Ramnatthan Alagappan commented on ZOOKEEPER-2528:
-------------------------------------------------

Hi Patrick,
Here is the log of one server when it restarts from the crash I have mentioned 
in the initial description. I have slightly anonymized the actual directory 
names and actual server ips. 

2016-04-14 20:30:10,350 [myid:] - INFO  [main:QuorumPeerConfig@103] - Reading 
configuration from: /tmp/zoo2.cfg
2016-04-14 20:30:10,364 [myid:] - INFO  [main:QuorumPeer$QuorumServer@149] - 
Resolved hostname: ip1 to address: /ip1
2016-04-14 20:30:10,364 [myid:] - INFO  [main:QuorumPeer$QuorumServer@149] - 
Resolved hostname: ip3 to address: /ip3
2016-04-14 20:30:10,364 [myid:] - INFO  [main:QuorumPeer$QuorumServer@149] - 
Resolved hostname: ip2 to address: /ip2
2016-04-14 20:30:10,364 [myid:] - INFO  [main:QuorumPeerConfig@331] - 
Defaulting to majority quorums
2016-04-14 20:30:10,367 [myid:1] - INFO  [main:DatadirCleanupManager@78] - 
autopurge.snapRetainCount set to 3
2016-04-14 20:30:10,367 [myid:1] - INFO  [main:DatadirCleanupManager@79] - 
autopurge.purgeInterval set to 0
2016-04-14 20:30:10,367 [myid:1] - INFO  [main:DatadirCleanupManager@101] - 
Purge task is not scheduled.
2016-04-14 20:30:10,376 [myid:1] - INFO  [main:QuorumPeerMain@127] - Starting 
quorum peer
2016-04-14 20:30:10,384 [myid:1] - INFO  [main:NIOServerCnxnFactory@89] - 
binding to port 0.0.0.0/0.0.0.0:2182
2016-04-14 20:30:10,389 [myid:1] - INFO  [main:QuorumPeer@1019] - tickTime set 
to 2000
2016-04-14 20:30:10,389 [myid:1] - INFO  [main:QuorumPeer@1039] - 
minSessionTimeout set to -1
2016-04-14 20:30:10,389 [myid:1] - INFO  [main:QuorumPeer@1050] - 
maxSessionTimeout set to -1
2016-04-14 20:30:10,389 [myid:1] - INFO  [main:QuorumPeer@1065] - initLimit set 
to 5
2016-04-14 20:30:10,398 [myid:1] - INFO  [main:FileSnap@83] - Reading snapshot 
data_dir/version-2/snapshot.100000002
2016-04-14 20:30:10,404 [myid:1] - ERROR [main:QuorumPeer@557] - Unable to load 
database on disk
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at 
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at 
org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
        at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:581)
        at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:600)
        at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:566)
        at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:648)
        at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:552)
        at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:527)
        at 
org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:354)
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:132)
        at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:510)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:500)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:153)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
2016-04-14 20:30:10,406 [myid:1] - ERROR [main:QuorumPeerMain@89] - Unexpected 
exception, exiting abnormally
java.lang.RuntimeException: Unable to run quorum server 
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:558)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:500)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:153)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at 
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at 
org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
        at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:581)
        at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:600)
        at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:566)
        at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:648)
        at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:552)
        at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:527)
        at 
org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:354)
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:132)
        at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:510)
        ... 4 more


> ZooKeeper cluster can become unavailable due to power failures
> --------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2528
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2528
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.8
>         Environment: A normal ZooKeeper cluster of 3 nodes running on 3 Linux 
> machines. 
>            Reporter: Ramnatthan Alagappan
>            Assignee: Abraham Fine
>            Priority: Critical
>
> ZooKeeper cluster can become unavailable if power failures happen at certain 
> specific points in time. 
> Details:
> I am running a three-node ZooKeeper cluster. I perform a simple update from a 
> client machine. 
> When I try to update a value, ZooKeeper creates a new log file (for example, 
> when the current log is fully utilized). First, it creates the file and 
> appends some header information to the newly created log. The system call 
> sequence looks like below:
> creat(log.200000001)
> append(log.200000001, offset=0,  count=16)
> Now, if a power failure happens just after the creat of the log file but 
> before the append of the header information, the node simply crashes with an 
> EOF exception. If the same problem occurs at two or more nodes in my 
> three-node cluster, the entire cluster becomes unavailable as the majority of 
> servers have crashed because of the above problem.  
> A power failure at the same time across multiple nodes may be possible in 
> single data center or single rack deployment scenarios. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to