[ https://issues.apache.org/jira/browse/ZOOKEEPER-4878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dharani updated ZOOKEEPER-4878: ------------------------------- Description: We are running zookeeper in kubernetes as stateful set with 3 replicas. when we performed chaos mesh IO fault experiment using [^IO_Fault.yaml], zookeeper servers are not recovering. Zookeeper config file: [^zoo.cfg] {code:java} 2024-10-24T09:43:40.896+0000 [myid:] - ERROR [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.s.ZooKeeperServer@552] - Severe unrecoverable error, exiting java.io.FileNotFoundException: /var/lib/zookeeper/data/version-2/snapshot.1100000859 (Input/output error) at java.base/java.io.FileOutputStream.open0(Native Method) at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298) at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237) at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:187) at org.apache.zookeeper.server.persistence.SnapStream.getOutputStream(SnapStream.java:133) at org.apache.zookeeper.server.persistence.FileSnap.serialize(FileSnap.java:242) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:481) at org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:550) at org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:544) at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:540) at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:597) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1552) 2024-10-24T09:43:40.898+0000 [myid:] - ERROR [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.u.ServiceUtils@48] - Exiting JVM with code 10 {code} Expectation: When IO_fault experiment using chaos mesh is performed for 60 sec (storage pause time), all the zookeeper servers should recover by itself without any manual intervention within the 10times the storage pause time. Is it possible to have partial traffic when PV is hanged? was: We are running zookeeper in kubernetes as stateful set with 3 replicas. when we performed chaos mesh IO fault experiment using [^IO_Fault.yaml], zookeeper servers are not recovering. Zookeeper config file: [^zoo.cfg] {code:java} 2024-10-24T09:43:40.896+0000 [myid:] - ERROR [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.s.ZooKeeperServer@552] - Severe unrecoverable error, exiting java.io.FileNotFoundException: /var/lib/zookeeper/data/version-2/snapshot.1100000859 (Input/output error) at java.base/java.io.FileOutputStream.open0(Native Method) at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298) at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237) at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:187) at org.apache.zookeeper.server.persistence.SnapStream.getOutputStream(SnapStream.java:133) at org.apache.zookeeper.server.persistence.FileSnap.serialize(FileSnap.java:242) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:481) at org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:550) at org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:544) at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:540) at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:597) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1552) 2024-10-24T09:43:40.898+0000 [myid:] - ERROR [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.u.ServiceUtils@48] - Exiting JVM with code 10 {code} Expectation: When IO_fault experiment using chaos mesh is performed for 60 sec (storage pause time), all the zookeeper servers should recover by itself without any manual intervention after 60 sec or within the 10times the storage pause time. Is it possible to have partial traffic when PV is hanged? > Zookeeper servers not running after Chaos mesh IO fault experiment > ------------------------------------------------------------------ > > Key: ZOOKEEPER-4878 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4878 > Project: ZooKeeper > Issue Type: Bug > Affects Versions: 3.8.3 > Reporter: Dharani > Priority: Major > Attachments: IO_Fault.yaml, zoo.cfg, zookeeper_logs.zip > > > We are running zookeeper in kubernetes as stateful set with 3 replicas. when > we performed chaos mesh IO fault experiment using [^IO_Fault.yaml], zookeeper > servers are not recovering. > Zookeeper config file: [^zoo.cfg] > {code:java} > 2024-10-24T09:43:40.896+0000 [myid:] - ERROR > [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.s.ZooKeeperServer@552] > - Severe unrecoverable error, exiting > java.io.FileNotFoundException: > /var/lib/zookeeper/data/version-2/snapshot.1100000859 (Input/output error) > at java.base/java.io.FileOutputStream.open0(Native Method) > at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298) > at > java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237) > at > java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:187) > at > org.apache.zookeeper.server.persistence.SnapStream.getOutputStream(SnapStream.java:133) > at > org.apache.zookeeper.server.persistence.FileSnap.serialize(FileSnap.java:242) > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:481) > at > org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:550) > at > org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:544) > at > org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:540) > at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:597) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1552) > 2024-10-24T09:43:40.898+0000 [myid:] - ERROR > [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.u.ServiceUtils@48] > - Exiting JVM with code 10 {code} > Expectation: When IO_fault experiment using chaos mesh is performed for 60 > sec (storage pause time), all the zookeeper servers should recover by itself > without any manual intervention within the 10times the storage pause time. > Is it possible to have partial traffic when PV is hanged? -- This message was sent by Atlassian Jira (v8.20.10#820010)