[ https://issues.apache.org/jira/browse/ZOOKEEPER-4878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dharani updated ZOOKEEPER-4878: ------------------------------- Attachment: IO_Fault.yaml > Zookeeper servers not running after Chaos mesh IO fault experiment > ------------------------------------------------------------------ > > Key: ZOOKEEPER-4878 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4878 > Project: ZooKeeper > Issue Type: Bug > Affects Versions: 3.8.3 > Reporter: Dharani > Priority: Major > Attachments: IO_Fault.yaml, zoo.cfg > > > We are running zookeeper in kubernetes as stateful set with 3 replicas. when > we performed chaos mesh IO fault experiment using , zookeeper servers are not > recovering. > {code:java} > 2024-10-24T09:43:40.896+0000 [myid:] - ERROR > [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.s.ZooKeeperServer@552] > - Severe unrecoverable error, exiting > java.io.FileNotFoundException: > /var/lib/zookeeper/data/version-2/snapshot.1100000859 (Input/output error) > at java.base/java.io.FileOutputStream.open0(Native Method) > at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298) > at > java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237) > at > java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:187) > at > org.apache.zookeeper.server.persistence.SnapStream.getOutputStream(SnapStream.java:133) > at > org.apache.zookeeper.server.persistence.FileSnap.serialize(FileSnap.java:242) > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:481) > at > org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:550) > at > org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:544) > at > org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:540) > at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:597) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1552) > 2024-10-24T09:43:40.898+0000 [myid:] - ERROR > [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.u.ServiceUtils@48] > - Exiting JVM with code 10 {code} > Expectation: When IO_fault experiment using chaos mesh is performed for 60 > sec, all the zookeeper servers should recover by itself without any manual > intervention. Is it possible to have partial traffic when PV is hanged? -- This message was sent by Atlassian Jira (v8.20.10#820010)