Young Xu created ZOOKEEPER-4624:
-----------------------------------

             Summary: Zookeeper service cannot restarted because the IO Inject 
filesystem fd is used up.
                 Key: ZOOKEEPER-4624
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4624
             Project: ZooKeeper
          Issue Type: Bug
         Environment: environment: *{color:#FF0000}K8S{color}*

deployment: *{color:#FF0000}statefulset replicas 3{color}*

zookeeper version: *{color:#FF0000}3.8.0{color}*
            Reporter: Young Xu


We're running a chaos test. and we've using this scenarios:
ZooKeeper pod is deployed on three nodes. We use {color:#FF0000}*IO 
injection*{color} to fill up the fd of one node(test one pod), and filesytem 
all operations return "Too many files". After a period of time, the ZooKeeper 
service stops running. Then we stopped the injection. When I manually start the 
process again, the ZooKeeper reports an error.
{code:java}
2022-10-19 02:03:07,876 [myid:3] - INFO  [main:o.a.z.s.q.QuorumPeer@2549] - 
QuorumPeer communication is not secured! (SASL auth disabled)2022-10-19 
02:03:07,876 [myid:3] - INFO  [main:o.a.z.s.q.QuorumPeer@2574] - 
quorum.cnxn.threads.size set to 202022-10-19 02:03:07,877 [myid:3] - INFO  
[main:o.a.z.s.p.FileSnap@85] - Reading snapshot 
/home/edge/middleware/zookeeper/data/data/version-2/snapshot.1409ce9ac72022-10-19
 02:03:07,883 [myid:3] - INFO  [main:o.a.z.s.DataTree@1705] - The digest in the 
snapshot has digest version of 2, with zxid as 0x1409ce9acc, and digest value 
as 816041257652022-10-19 02:03:11,662 [myid:3] - ERROR 
[main:o.a.z.s.q.QuorumPeer@1200] - Unable to load database on 
diskjava.io.EOFException: null    at 
java.base/java.io.DataInputStream.readInt(Unknown Source)    at 
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96)    at 
org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:67)
    at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:707)
    at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:725)
    at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:693)
    at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:774)
    at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:361)
    at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:267)
    at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:312)
    at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285) 
   at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1146)
    at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1132)    at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:229)
    at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:137)
    at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:91)2022-10-19
 02:03:11,663 [myid:3] - INFO  [main:o.a.z.m.p.PrometheusMetricsProvider@570] - 
Shutdown executor service with timeout 10002022-10-19 02:03:11,739 [myid:3] - 
INFO  [main:o.e.j.s.AbstractConnector@383] - Stopped 
ServerConnector@5b03b9fe{HTTP/1.1, 
(http/1.1)}{zookeeper-default-2.zookeeper.default.svc.cluster.local:8080}2022-10-19
 02:03:11,742 [myid:3] - INFO  [main:o.e.j.s.h.ContextHandler@1159] - Stopped 
o.e.j.s.ServletContextHandler@17bffc17{/,null,STOPPED}2022-10-19 02:03:11,746 
[myid:3] - ERROR [main:o.a.z.s.q.QuorumPeerMain@114] - Unexpected exception, 
exiting abnormallyjava.lang.RuntimeException: Unable to run quorum server     
at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1201)
    at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1132)    at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:229)
    at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:137)
    at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:91)Caused
 by: java.io.EOFException: null    at 
java.base/java.io.DataInputStream.readInt(Unknown Source)    at 
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96)    at 
org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:67)
    at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:707)
    at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:725)
    at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:693)
    at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:774)
    at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:361)
    at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:267)
    at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:312)
    at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285) 
   at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1146)
    ... 4 common frames omitted2022-10-19 02:03:11,747 [myid:3] - INFO  
[main:o.a.z.a.ZKAuditProvider@42] - ZooKeeper audit is disabled.2022-10-19 
02:03:11,749 [myid:3] - ERROR [main:o.a.z.u.ServiceUtils@48] - Exiting JVM with 
code 1 {code}
Now I know delete data directory can fix this and get the service up and 
running. but I dont know why the file is corrupted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to