[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829395#comment-17829395
 ] 

Xin Chen commented on ZOOKEEPER-3975:
-------------------------------------

I also encountered this issue in a three-node cluster. It occurred when 
ZooKeeper was unexpectedly offline and then came back online. zk0 and zk2 were 
able to come online successfully, with zk2 being the leader. However, zk1 
encountered this error while loading a particular transaction log 
log.300a8108e0:
{code:java}
2024-03-18 15:28:52,000 [myid:2] - ERROR [main:QuorumPeer@698] - Unable to load 
database on disk
java.io.IOException: Unreasonable length = 11807835
        at 
org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:127)
        at 
org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:92)
        at 
org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:208)
        at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:632)
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:219)
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:176)
        at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:217)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:651)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)
2024-03-18 15:28:52,005 [myid:2] - ERROR [main:QuorumPeerMain@92] - Unexpected 
exception, exiting abnormally
java.lang.RuntimeException: Unable to run quorum server 
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:699)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)
Caused by: java.io.IOException: Unreasonable length = 11807835
        at 
org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:127)
        at 
org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:92)
        at 
org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:208)
        at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:632)
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:219)
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:176)
        at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:217)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:651)
        ... 4 more
{code}


I tried to individually parse and read the transaction log log.300a8108e0 using 
“zkTxnLogToolkit.sh -d ./log.300a8108e0 and found that the error occurred while 
loading a specific transaction record in the middle of the log (*approximately 
the 38448th record*). This record is not the last record in the transaction 
log, which *consists of over 90,000 records in total*. Since I couldn't parse 
it, I couldn't determine what caused this particular transaction record to 
become corrupted. I'm puzzled as to why the corruption occurred in the middle 
of the transaction log instead of at the end. Normally, if there were abnormal 
events such as power outage or disk full, the last line of the transaction log 
could be affected. So, I'm wondering what circumstances could lead to 
corruption in the middle of the transaction log, resulting in this error. The 
ZooKeeper process is unable to recover automatically and keeps restarting due 
to this IOException. Indeed, this is a very serious issue.  

How can I  get the help? 
The attachments are the logs of ZooKeeper.  [^zookeeper.log.2] 


> Zookeeper crashes: Unable to load database on disk java.io.IOException: 
> Unreasonable length
> -------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3975
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: jute
>    Affects Versions: 3.6.2
>         Environment: Debian 10 x64
> openjdk version "11.0.8" 2020-07-14
> OpenJDK Runtime Environment (build 11.0.8+10-post-Debian-1deb10u1)
> OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Debian-1deb10u1, mixed mode, 
> sharing)
>            Reporter: Diego Lucas Jiménez
>            Priority: Critical
>
> After running for a while, the entire cluster (3 zookeeper) crash suddenly, 
> all of them logging:
>  
> {code:java}
> 2020-10-16 10:37:00,459 [myid:2] - WARN [NIOWorkerThread-4:NIOServerCnxn@373] 
> - Close of session 0x0 java.io.IOException: ZooKeeperServer not running at 
> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) 
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) at 
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
>  at 
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  at java.base/java.lang.Thread.run(Thread.java:834)
> 2020-10-16 10:37:00,475 [myid:2] - ERROR 
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1139] - 
> Unable to load database on disk
> java.io.IOException: Unreasonable length = 5089607
>         at 
> org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166)
>         at 
> org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127)
>         at 
> org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:768)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:352)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:258)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:303)
>         at 
> org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1093)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.getLastLoggedZxid(QuorumPeer.java:1249)
>         at 
> org.apache.zookeeper.server.quorum.FastLeaderElection.getInitLastLoggedZxid(FastLeaderElection.java:868)
>         at 
> org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:941)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1428){code}
> Apparently the "corrupted" file appears in all the servers, so no solution 
> such as "removing version-2 on the faulty server and letting replicate from a 
> healthy one" :(.
> The entire cluster goes down, we have downtime, every-single-day since we 
> upgraded from 3.4.9. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to