[ https://issues.apache.org/jira/browse/ZOOKEEPER-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829395#comment-17829395 ]
Xin Chen edited comment on ZOOKEEPER-3975 at 3/21/24 6:56 AM: -------------------------------------------------------------- I also encountered this issue in a three-node cluster. It occurred when ZooKeeper was unexpectedly offline and then came back online. zk0 and zk2 were able to come online successfully, with zk2 being the leader. However, zk1 encountered this error while loading a particular transaction log log.300a8108e0: {code:java} 2024-03-18 15:28:52,000 [myid:2] - ERROR [main:QuorumPeer@698] - Unable to load database on disk java.io.IOException: Unreasonable length = 11807835 at org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:127) at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:92) at org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:208) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:632) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:219) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:176) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:217) at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:651) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81) 2024-03-18 15:28:52,005 [myid:2] - ERROR [main:QuorumPeerMain@92] - Unexpected exception, exiting abnormally java.lang.RuntimeException: Unable to run quorum server at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:699) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81) Caused by: java.io.IOException: Unreasonable length = 11807835 at org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:127) at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:92) at org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:208) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:632) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:219) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:176) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:217) at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:651) ... 4 more {code} I tried to individually parse and read the transaction log log.300a8108e0 using “zkTxnLogToolkit.sh -d ./log.300a8108e0 and found that the error occurred while loading a specific transaction record in the middle of the log ({*}approximately the 38448th record{*}). This record is not the last record in the transaction log, which {*}consists of over 90,000 records in total{*}. Since I couldn't parse it, I couldn't determine what caused this particular transaction record to become corrupted. I'm puzzled as to why the corruption occurred in the middle of the transaction log instead of at the end. Normally, if there were abnormal events such as power outage or disk full, the last line of the transaction log could be affected. So, I'm wondering what circumstances could lead to corruption in the middle of the transaction log, resulting in this error. The ZooKeeper process is unable to recover automatically and keeps restarting due to this IOException. Indeed, this is a very serious issue. Who has experience with this situation? was (Author: JIRAUSER298666): I also encountered this issue in a three-node cluster. It occurred when ZooKeeper was unexpectedly offline and then came back online. zk0 and zk2 were able to come online successfully, with zk2 being the leader. However, zk1 encountered this error while loading a particular transaction log log.300a8108e0: {code:java} 2024-03-18 15:28:52,000 [myid:2] - ERROR [main:QuorumPeer@698] - Unable to load database on disk java.io.IOException: Unreasonable length = 11807835 at org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:127) at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:92) at org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:208) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:632) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:219) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:176) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:217) at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:651) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81) 2024-03-18 15:28:52,005 [myid:2] - ERROR [main:QuorumPeerMain@92] - Unexpected exception, exiting abnormally java.lang.RuntimeException: Unable to run quorum server at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:699) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81) Caused by: java.io.IOException: Unreasonable length = 11807835 at org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:127) at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:92) at org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:208) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:632) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:219) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:176) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:217) at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:651) ... 4 more {code} I tried to individually parse and read the transaction log log.300a8108e0 using “zkTxnLogToolkit.sh -d ./log.300a8108e0 and found that the error occurred while loading a specific transaction record in the middle of the log ({*}approximately the 38448th record{*}). This record is not the last record in the transaction log, which {*}consists of over 90,000 records in total{*}. Since I couldn't parse it, I couldn't determine what caused this particular transaction record to become corrupted. I'm puzzled as to why the corruption occurred in the middle of the transaction log instead of at the end. Normally, if there were abnormal events such as power outage or disk full, the last line of the transaction log could be affected. So, I'm wondering what circumstances could lead to corruption in the middle of the transaction log, resulting in this error. The ZooKeeper process is unable to recover automatically and keeps restarting due to this IOException. Indeed, this is a very serious issue. How can I get the help? > Zookeeper crashes: Unable to load database on disk java.io.IOException: > Unreasonable length > ------------------------------------------------------------------------------------------- > > Key: ZOOKEEPER-3975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3975 > Project: ZooKeeper > Issue Type: Bug > Components: jute > Affects Versions: 3.6.2 > Environment: Debian 10 x64 > openjdk version "11.0.8" 2020-07-14 > OpenJDK Runtime Environment (build 11.0.8+10-post-Debian-1deb10u1) > OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Debian-1deb10u1, mixed mode, > sharing) > Reporter: Diego Lucas Jiménez > Priority: Critical > > After running for a while, the entire cluster (3 zookeeper) crash suddenly, > all of them logging: > > {code:java} > 2020-10-16 10:37:00,459 [myid:2] - WARN [NIOWorkerThread-4:NIOServerCnxn@373] > - Close of session 0x0 java.io.IOException: ZooKeeperServer not running at > org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) > at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) at > org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) > at > org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) > 2020-10-16 10:37:00,475 [myid:2] - ERROR > [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1139] - > Unable to load database on disk > java.io.IOException: Unreasonable length = 5089607 > at > org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166) > at > org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127) > at > org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159) > at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:768) > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:352) > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:258) > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:303) > at > org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285) > at > org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1093) > at > org.apache.zookeeper.server.quorum.QuorumPeer.getLastLoggedZxid(QuorumPeer.java:1249) > at > org.apache.zookeeper.server.quorum.FastLeaderElection.getInitLastLoggedZxid(FastLeaderElection.java:868) > at > org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:941) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1428){code} > Apparently the "corrupted" file appears in all the servers, so no solution > such as "removing version-2 on the faulty server and letting replicate from a > healthy one" :(. > The entire cluster goes down, we have downtime, every-single-day since we > upgraded from 3.4.9. > -- This message was sent by Atlassian Jira (v8.20.10#820010)