[
https://issues.apache.org/jira/browse/ZOOKEEPER-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885709#comment-17885709
]
Kezhu Wang commented on ZOOKEEPER-3975:
---------------------------------------
{quote}
I'm puzzled as to why the corruption occurred in the middle of the transaction
log instead of at the end. Normally, if there were abnormal events such as
power outage or disk full, the last line of the transaction log could be
affected. So, I'm wondering what circumstances could lead to corruption in the
middle of the transaction log, resulting in this error.
{quote}
Does {{zookeeper.preAllocSize}} contribute to this ?
> Zookeeper crashes: Unable to load database on disk java.io.IOException:
> Unreasonable length
> -------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-3975
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3975
> Project: ZooKeeper
> Issue Type: Bug
> Components: jute
> Affects Versions: 3.6.2
> Environment: Debian 10 x64
> openjdk version "11.0.8" 2020-07-14
> OpenJDK Runtime Environment (build 11.0.8+10-post-Debian-1deb10u1)
> OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Debian-1deb10u1, mixed mode,
> sharing)
> Reporter: Diego Lucas Jiménez
> Priority: Critical
>
> After running for a while, the entire cluster (3 zookeeper) crash suddenly,
> all of them logging:
>
> {code:java}
> 2020-10-16 10:37:00,459 [myid:2] - WARN [NIOWorkerThread-4:NIOServerCnxn@373]
> - Close of session 0x0 java.io.IOException: ZooKeeperServer not running at
> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) at
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
> at
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834)
> 2020-10-16 10:37:00,475 [myid:2] - ERROR
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1139] -
> Unable to load database on disk
> java.io.IOException: Unreasonable length = 5089607
> at
> org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166)
> at
> org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127)
> at
> org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159)
> at
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:768)
> at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:352)
> at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:258)
> at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:303)
> at
> org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1093)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.getLastLoggedZxid(QuorumPeer.java:1249)
> at
> org.apache.zookeeper.server.quorum.FastLeaderElection.getInitLastLoggedZxid(FastLeaderElection.java:868)
> at
> org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:941)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1428){code}
> Apparently the "corrupted" file appears in all the servers, so no solution
> such as "removing version-2 on the faulty server and letting replicate from a
> healthy one" :(.
> The entire cluster goes down, we have downtime, every-single-day since we
> upgraded from 3.4.9.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)