[
https://issues.apache.org/jira/browse/ZOOKEEPER-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829395#comment-17829395
]
Xin Chen edited comment on ZOOKEEPER-3975 at 3/21/24 6:40 AM:
--------------------------------------------------------------
I also encountered this issue in a three-node cluster. It occurred when
ZooKeeper was unexpectedly offline and then came back online. zk0 and zk2 were
able to come online successfully, with zk2 being the leader. However, zk1
encountered this error while loading a particular transaction log
log.300a8108e0:
{code:java}
2024-03-18 15:28:52,000 [myid:2] - ERROR [main:QuorumPeer@698] - Unable to load
database on disk
java.io.IOException: Unreasonable length = 11807835
at
org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:127)
at
org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:92)
at
org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:208)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:632)
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:219)
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:176)
at
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:217)
at
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:651)
at
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)
2024-03-18 15:28:52,005 [myid:2] - ERROR [main:QuorumPeerMain@92] - Unexpected
exception, exiting abnormally
java.lang.RuntimeException: Unable to run quorum server
at
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:699)
at
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)
Caused by: java.io.IOException: Unreasonable length = 11807835
at
org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:127)
at
org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:92)
at
org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:208)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:632)
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:219)
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:176)
at
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:217)
at
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:651)
... 4 more
{code}
I tried to individually parse and read the transaction log log.300a8108e0 using
“zkTxnLogToolkit.sh -d ./log.300a8108e0 and found that the error occurred while
loading a specific transaction record in the middle of the log (*approximately
the 38448th record*). This record is not the last record in the transaction
log, which *consists of over 90,000 records in total*. Since I couldn't parse
it, I couldn't determine what caused this particular transaction record to
become corrupted. I'm puzzled as to why the corruption occurred in the middle
of the transaction log instead of at the end. Normally, if there were abnormal
events such as power outage or disk full, the last line of the transaction log
could be affected. So, I'm wondering what circumstances could lead to
corruption in the middle of the transaction log, resulting in this error. The
ZooKeeper process is unable to recover automatically and keeps restarting due
to this IOException. Indeed, this is a very serious issue.
How can I get the help?
The attachments are the logs of ZooKeeper.
was (Author: JIRAUSER298666):
I also encountered this issue in a three-node cluster. It occurred when
ZooKeeper was unexpectedly offline and then came back online. zk0 and zk2 were
able to come online successfully, with zk2 being the leader. However, zk1
encountered this error while loading a particular transaction log
log.300a8108e0:
{code:java}
2024-03-18 15:28:52,000 [myid:2] - ERROR [main:QuorumPeer@698] - Unable to load
database on disk
java.io.IOException: Unreasonable length = 11807835
at
org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:127)
at
org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:92)
at
org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:208)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:632)
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:219)
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:176)
at
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:217)
at
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:651)
at
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)
2024-03-18 15:28:52,005 [myid:2] - ERROR [main:QuorumPeerMain@92] - Unexpected
exception, exiting abnormally
java.lang.RuntimeException: Unable to run quorum server
at
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:699)
at
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)
Caused by: java.io.IOException: Unreasonable length = 11807835
at
org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:127)
at
org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:92)
at
org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:208)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:632)
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:219)
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:176)
at
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:217)
at
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:651)
... 4 more
{code}
I tried to individually parse and read the transaction log log.300a8108e0 using
“zkTxnLogToolkit.sh -d ./log.300a8108e0 and found that the error occurred while
loading a specific transaction record in the middle of the log (*approximately
the 38448th record*). This record is not the last record in the transaction
log, which *consists of over 90,000 records in total*. Since I couldn't parse
it, I couldn't determine what caused this particular transaction record to
become corrupted. I'm puzzled as to why the corruption occurred in the middle
of the transaction log instead of at the end. Normally, if there were abnormal
events such as power outage or disk full, the last line of the transaction log
could be affected. So, I'm wondering what circumstances could lead to
corruption in the middle of the transaction log, resulting in this error. The
ZooKeeper process is unable to recover automatically and keeps restarting due
to this IOException. Indeed, this is a very serious issue.
How can I get the help?
The attachments are the logs of ZooKeeper. [^zookeeper.log.2]
> Zookeeper crashes: Unable to load database on disk java.io.IOException:
> Unreasonable length
> -------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-3975
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3975
> Project: ZooKeeper
> Issue Type: Bug
> Components: jute
> Affects Versions: 3.6.2
> Environment: Debian 10 x64
> openjdk version "11.0.8" 2020-07-14
> OpenJDK Runtime Environment (build 11.0.8+10-post-Debian-1deb10u1)
> OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Debian-1deb10u1, mixed mode,
> sharing)
> Reporter: Diego Lucas Jiménez
> Priority: Critical
>
> After running for a while, the entire cluster (3 zookeeper) crash suddenly,
> all of them logging:
>
> {code:java}
> 2020-10-16 10:37:00,459 [myid:2] - WARN [NIOWorkerThread-4:NIOServerCnxn@373]
> - Close of session 0x0 java.io.IOException: ZooKeeperServer not running at
> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) at
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
> at
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834)
> 2020-10-16 10:37:00,475 [myid:2] - ERROR
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1139] -
> Unable to load database on disk
> java.io.IOException: Unreasonable length = 5089607
> at
> org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166)
> at
> org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127)
> at
> org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159)
> at
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:768)
> at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:352)
> at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:258)
> at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:303)
> at
> org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1093)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.getLastLoggedZxid(QuorumPeer.java:1249)
> at
> org.apache.zookeeper.server.quorum.FastLeaderElection.getInitLastLoggedZxid(FastLeaderElection.java:868)
> at
> org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:941)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1428){code}
> Apparently the "corrupted" file appears in all the servers, so no solution
> such as "removing version-2 on the faulty server and letting replicate from a
> healthy one" :(.
> The entire cluster goes down, we have downtime, every-single-day since we
> upgraded from 3.4.9.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)