[ https://issues.apache.org/jira/browse/ZOOKEEPER-4878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dharani updated ZOOKEEPER-4878: ------------------------------- Description: We are running zookeeper in kubernetes as stateful set with 3 replicas. when we performed chaos mesh IO fault experiment using , zookeeper servers are not recovering. {code:java} 2024-10-24T09:43:40.896+0000 [myid:] - ERROR [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.s.ZooKeeperServer@552] - Severe unrecoverable error, exiting java.io.FileNotFoundException: /var/lib/zookeeper/data/version-2/snapshot.1100000859 (Input/output error) at java.base/java.io.FileOutputStream.open0(Native Method) at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298) at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237) at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:187) at org.apache.zookeeper.server.persistence.SnapStream.getOutputStream(SnapStream.java:133) at org.apache.zookeeper.server.persistence.FileSnap.serialize(FileSnap.java:242) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:481) at org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:550) at org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:544) at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:540) at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:597) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1552) 2024-10-24T09:43:40.898+0000 [myid:] - ERROR [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.u.ServiceUtils@48] - Exiting JVM with code 10 {code} Expectation: When IO_fault experiment using chaos mesh is performed for 60 sec, all the zookeeper servers should recover by itself without any manual intervention. Is it possible to have partial traffic when PV is hanged? was: We are running zookeeper in kubernetes as stateful set with 3 replicas. when we performed chaos mesh IO fault experiment, zookeeper servers are not recovering. {code:java} 2024-10-24T09:43:40.896+0000 [myid:] - ERROR [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.s.ZooKeeperServer@552] - Severe unrecoverable error, exiting java.io.FileNotFoundException: /var/lib/zookeeper/data/version-2/snapshot.1100000859 (Input/output error) at java.base/java.io.FileOutputStream.open0(Native Method) at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298) at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237) at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:187) at org.apache.zookeeper.server.persistence.SnapStream.getOutputStream(SnapStream.java:133) at org.apache.zookeeper.server.persistence.FileSnap.serialize(FileSnap.java:242) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:481) at org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:550) at org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:544) at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:540) at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:597) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1552) 2024-10-24T09:43:40.898+0000 [myid:] - ERROR [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.u.ServiceUtils@48] - Exiting JVM with code 10 {code} > Zookeeper servers not running after Chaos mesh IO fault experiment > ------------------------------------------------------------------ > > Key: ZOOKEEPER-4878 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4878 > Project: ZooKeeper > Issue Type: Bug > Affects Versions: 3.8.3 > Reporter: Dharani > Priority: Major > > We are running zookeeper in kubernetes as stateful set with 3 replicas. when > we performed chaos mesh IO fault experiment using , zookeeper servers are not > recovering. > {code:java} > 2024-10-24T09:43:40.896+0000 [myid:] - ERROR > [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.s.ZooKeeperServer@552] > - Severe unrecoverable error, exiting > java.io.FileNotFoundException: > /var/lib/zookeeper/data/version-2/snapshot.1100000859 (Input/output error) > at java.base/java.io.FileOutputStream.open0(Native Method) > at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298) > at > java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237) > at > java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:187) > at > org.apache.zookeeper.server.persistence.SnapStream.getOutputStream(SnapStream.java:133) > at > org.apache.zookeeper.server.persistence.FileSnap.serialize(FileSnap.java:242) > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:481) > at > org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:550) > at > org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:544) > at > org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:540) > at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:597) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1552) > 2024-10-24T09:43:40.898+0000 [myid:] - ERROR > [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.u.ServiceUtils@48] > - Exiting JVM with code 10 {code} > Expectation: When IO_fault experiment using chaos mesh is performed for 60 > sec, all the zookeeper servers should recover by itself without any manual > intervention. Is it possible to have partial traffic when PV is hanged? -- This message was sent by Atlassian Jira (v8.20.10#820010)