[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829395#comment-17829395
 ] 

Xin Chen edited comment on ZOOKEEPER-3975 at 3/21/24 6:56 AM:
--------------------------------------------------------------

I also encountered this issue in a three-node cluster. It occurred when 
ZooKeeper was unexpectedly offline and then came back online. zk0 and zk2 were 
able to come online successfully, with zk2 being the leader. However, zk1 
encountered this error while loading a particular transaction log 
log.300a8108e0:
{code:java}
2024-03-18 15:28:52,000 [myid:2] - ERROR [main:QuorumPeer@698] - Unable to load 
database on disk
java.io.IOException: Unreasonable length = 11807835
        at 
org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:127)
        at 
org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:92)
        at 
org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:208)
        at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:632)
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:219)
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:176)
        at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:217)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:651)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)
2024-03-18 15:28:52,005 [myid:2] - ERROR [main:QuorumPeerMain@92] - Unexpected 
exception, exiting abnormally
java.lang.RuntimeException: Unable to run quorum server 
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:699)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)
Caused by: java.io.IOException: Unreasonable length = 11807835
        at 
org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:127)
        at 
org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:92)
        at 
org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:208)
        at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:632)
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:219)
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:176)
        at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:217)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:651)
        ... 4 more
{code}
I tried to individually parse and read the transaction log log.300a8108e0 using 
“zkTxnLogToolkit.sh -d ./log.300a8108e0 and found that the error occurred while 
loading a specific transaction record in the middle of the log 
({*}approximately the 38448th record{*}). This record is not the last record in 
the transaction log, which {*}consists of over 90,000 records in total{*}. 
Since I couldn't parse it, I couldn't determine what caused this particular 
transaction record to become corrupted. I'm puzzled as to why the corruption 
occurred in the middle of the transaction log instead of at the end. Normally, 
if there were abnormal events such as power outage or disk full, the last line 
of the transaction log could be affected. So, I'm wondering what circumstances 
could lead to corruption in the middle of the transaction log, resulting in 
this error. The ZooKeeper process is unable to recover automatically and keeps 
restarting due to this IOException. Indeed, this is a very serious issue.

Who has experience with this situation?


was (Author: JIRAUSER298666):
I also encountered this issue in a three-node cluster. It occurred when 
ZooKeeper was unexpectedly offline and then came back online. zk0 and zk2 were 
able to come online successfully, with zk2 being the leader. However, zk1 
encountered this error while loading a particular transaction log 
log.300a8108e0:
{code:java}
2024-03-18 15:28:52,000 [myid:2] - ERROR [main:QuorumPeer@698] - Unable to load 
database on disk
java.io.IOException: Unreasonable length = 11807835
        at 
org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:127)
        at 
org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:92)
        at 
org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:208)
        at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:632)
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:219)
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:176)
        at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:217)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:651)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)
2024-03-18 15:28:52,005 [myid:2] - ERROR [main:QuorumPeerMain@92] - Unexpected 
exception, exiting abnormally
java.lang.RuntimeException: Unable to run quorum server 
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:699)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)
Caused by: java.io.IOException: Unreasonable length = 11807835
        at 
org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:127)
        at 
org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:92)
        at 
org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:208)
        at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:632)
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:219)
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:176)
        at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:217)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:651)
        ... 4 more
{code}
I tried to individually parse and read the transaction log log.300a8108e0 using 
“zkTxnLogToolkit.sh -d ./log.300a8108e0 and found that the error occurred while 
loading a specific transaction record in the middle of the log 
({*}approximately the 38448th record{*}). This record is not the last record in 
the transaction log, which {*}consists of over 90,000 records in total{*}. 
Since I couldn't parse it, I couldn't determine what caused this particular 
transaction record to become corrupted. I'm puzzled as to why the corruption 
occurred in the middle of the transaction log instead of at the end. Normally, 
if there were abnormal events such as power outage or disk full, the last line 
of the transaction log could be affected. So, I'm wondering what circumstances 
could lead to corruption in the middle of the transaction log, resulting in 
this error. The ZooKeeper process is unable to recover automatically and keeps 
restarting due to this IOException. Indeed, this is a very serious issue.

How can I get the help?

> Zookeeper crashes: Unable to load database on disk java.io.IOException: 
> Unreasonable length
> -------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3975
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: jute
>    Affects Versions: 3.6.2
>         Environment: Debian 10 x64
> openjdk version "11.0.8" 2020-07-14
> OpenJDK Runtime Environment (build 11.0.8+10-post-Debian-1deb10u1)
> OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Debian-1deb10u1, mixed mode, 
> sharing)
>            Reporter: Diego Lucas Jiménez
>            Priority: Critical
>
> After running for a while, the entire cluster (3 zookeeper) crash suddenly, 
> all of them logging:
>  
> {code:java}
> 2020-10-16 10:37:00,459 [myid:2] - WARN [NIOWorkerThread-4:NIOServerCnxn@373] 
> - Close of session 0x0 java.io.IOException: ZooKeeperServer not running at 
> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) 
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) at 
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
>  at 
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  at java.base/java.lang.Thread.run(Thread.java:834)
> 2020-10-16 10:37:00,475 [myid:2] - ERROR 
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1139] - 
> Unable to load database on disk
> java.io.IOException: Unreasonable length = 5089607
>         at 
> org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166)
>         at 
> org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127)
>         at 
> org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:768)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:352)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:258)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:303)
>         at 
> org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1093)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.getLastLoggedZxid(QuorumPeer.java:1249)
>         at 
> org.apache.zookeeper.server.quorum.FastLeaderElection.getInitLastLoggedZxid(FastLeaderElection.java:868)
>         at 
> org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:941)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1428){code}
> Apparently the "corrupted" file appears in all the servers, so no solution 
> such as "removing version-2 on the faulty server and letting replicate from a 
> healthy one" :(.
> The entire cluster goes down, we have downtime, every-single-day since we 
> upgraded from 3.4.9. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to