[
https://issues.apache.org/jira/browse/ZOOKEEPER-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17234811#comment-17234811
]
Jason commented on ZOOKEEPER-2553:
----------------------------------
Seeing the same/similar problem within Docker Swarm just but doing a simple
stop/start of the Stack.
This is Zookeeper 3.5.8:
{{[2020-11-18 16:38:59,163] INFO Started @2367ms
(org.eclipse.jetty.server.Server),}}
{{[2020-11-18 16:38:59,164] INFO Started AdminServer on address 0.0.0.0, port
8080 and command URL /commands
(org.apache.zookeeper.server.admin.JettyAdminServer),}}
{{[2020-11-18 16:38:59,191] INFO Using
org.apache.zookeeper.server.NIOServerCnxnFactory as server connection factory
(org.apache.zookeeper.server.ServerCnxnFactory),}}
{{[2020-11-18 16:38:59,194] INFO Configuring NIO connection handler with 10s
sessionless connection timeout, 1 selector thread(s), 8 worker threads, and 64
kB direct buffers. (org.apache.zookeeper.server.NIOServerCnxnFactory),}}
{{[2020-11-18 16:38:59,196] INFO binding to port 0.0.0.0/0.0.0.0:2181
(org.apache.zookeeper.server.NIOServerCnxnFactory),}}
{{[2020-11-18 16:38:59,232] INFO zookeeper.snapshotSizeFactor = 0.33
(org.apache.zookeeper.server.ZKDatabase),}}
{{[2020-11-18 16:38:59,236] INFO Reading snapshot
/var/lib/zookeeper/data/version-2/snapshot.0
(org.apache.zookeeper.server.persistence.FileSnap),}}
{{[2020-11-18 16:38:59,273] ERROR Unexpected exception, exiting abnormally
(org.apache.zookeeper.server.ZooKeeperServerMain),}}
{{java.io.IOException: Unreasonable length = 1702047600,}}
{{ at
org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:146),}}
{{ at
org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:111),}}
{{ at
org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:205),}}
{{ at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:684),}}
{{ at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:294),}}
{{ at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:229),}}
{{ at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:253),}}
{{ at
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240),}}
{{ at
org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:290),}}
{{ at
org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:450),}}
{{ at
org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:764),}}
{{ at
org.apache.zookeeper.server.ServerCnxnFactory.startup(ServerCnxnFactory.java:98),}}
{{ at
org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:144),}}
{{ at
org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:106),}}
{{ at
org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:64),}}
{{ at
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:128),}}
{{ at
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82),}}
Is there any workaround or mitigation?
> ZooKeeper cluster unavailable due to corrupted log file during power failures
> -- java.io.IOException: Unreasonable length
> -------------------------------------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-2553
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2553
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.4.8
> Environment: Normal ZooKeeper cluster with 3 nodes running Linux
> Reporter: Ramnatthan Alagappan
> Priority: Major
>
> I am running a three node ZooKeeper cluster.
> When a new log file is created by ZooKeeper, I see the following sequence of
> system calls:
> 1. creat(new_log)
> 2. write(new_log, count=16) // This is a log header I believe/
> 3. truncate(new_log, from 16 bytes to 16 KBytes) // I have configured the log
> size to be 16K.
> When the above sequence of operations complete, it is reasonable to expect
> the newly created log file to contain the header(16 bytes) and then filled
> with zeros till the end of the log.
> But when a crash occurs (due to a power failure), while the truncate system
> call is in progress, it is possible for the log to contain garbage data when
> the system restarts from the crash. Note that if the crash occurs just after
> the truncate system call completes, then there is no problem. Basically, the
> truncate needs to be atomically persisted for ZooKeeper to recover from
> crashes correctly or (more realistically) the recovery code needs to deal
> with the case of expecting garbage in a newly created log.
> As mentioned, if a crash occurs during the truncate system call, then
> ZooKeeper will fail to start with the following exception. Here is the stack
> trace:
> java.io.IOException: Unreasonable length = -295704495
> at
> org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:127)
> at
> org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:92)
> at
> org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:233)
> at
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625)
> at
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:652)
> at
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:552)
> at
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:527)
> at
> org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:354)
> at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:132)
> at
> org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:510)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:500)
> at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:153)
> at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
> at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
> [myid:1] - ERROR [main:QuorumPeerMain@89] - Unexpected exception, exiting
> abnormally
> java.lang.RuntimeException: Unable to run quorum server
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:558)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:500)
> at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:153)
> at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
> at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
> Caused by: java.io.IOException: Unreasonable length = -295704495
> at
> org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:127)
> at
> org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:92)
> at
> org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:233)
> at
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625)
> at
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:652)
> at
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:552)
> at
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:527)
> at
> org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:354)
> at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:132)
> at
> org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:510)
> ... 4 more
> Next, it is possible for two nodes of a 3-node ZooKeeper cluster to reach
> the same state. In that case, they both will fail to startup, rendering the
> entire cluster unavailable.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)