[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dharani updated ZOOKEEPER-4878:
-------------------------------
    Description: 
We are running zookeeper in kubernetes as stateful set with 3 replicas. when we 
performed chaos mesh IO fault experiment using , zookeeper servers are not 
recovering.
{code:java}
2024-10-24T09:43:40.896+0000 [myid:] - ERROR 
[QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.s.ZooKeeperServer@552]
 - Severe unrecoverable error, exiting
java.io.FileNotFoundException: 
/var/lib/zookeeper/data/version-2/snapshot.1100000859 (Input/output error)
        at java.base/java.io.FileOutputStream.open0(Native Method)
        at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:187)
        at 
org.apache.zookeeper.server.persistence.SnapStream.getOutputStream(SnapStream.java:133)
        at 
org.apache.zookeeper.server.persistence.FileSnap.serialize(FileSnap.java:242)
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:481)
        at 
org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:550)
        at 
org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:544)
        at 
org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:540)
        at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:597)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1552)
2024-10-24T09:43:40.898+0000 [myid:] - ERROR 
[QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.u.ServiceUtils@48]
 - Exiting JVM with code 10 {code}
Expectation: When IO_fault experiment using chaos mesh is performed for 60 sec, 
all the zookeeper servers should recover by itself without any manual 
intervention. Is it possible to have partial traffic when PV is hanged? 

  was:
We are running zookeeper in kubernetes as stateful set with 3 replicas. when we 
performed chaos mesh IO fault experiment, zookeeper servers are not recovering.
{code:java}
2024-10-24T09:43:40.896+0000 [myid:] - ERROR 
[QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.s.ZooKeeperServer@552]
 - Severe unrecoverable error, exiting
java.io.FileNotFoundException: 
/var/lib/zookeeper/data/version-2/snapshot.1100000859 (Input/output error)
        at java.base/java.io.FileOutputStream.open0(Native Method)
        at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:187)
        at 
org.apache.zookeeper.server.persistence.SnapStream.getOutputStream(SnapStream.java:133)
        at 
org.apache.zookeeper.server.persistence.FileSnap.serialize(FileSnap.java:242)
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:481)
        at 
org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:550)
        at 
org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:544)
        at 
org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:540)
        at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:597)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1552)
2024-10-24T09:43:40.898+0000 [myid:] - ERROR 
[QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.u.ServiceUtils@48]
 - Exiting JVM with code 10 {code}
 


> Zookeeper servers not running after Chaos mesh IO fault experiment
> ------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4878
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4878
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.8.3
>            Reporter: Dharani
>            Priority: Major
>
> We are running zookeeper in kubernetes as stateful set with 3 replicas. when 
> we performed chaos mesh IO fault experiment using , zookeeper servers are not 
> recovering.
> {code:java}
> 2024-10-24T09:43:40.896+0000 [myid:] - ERROR 
> [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.s.ZooKeeperServer@552]
>  - Severe unrecoverable error, exiting
> java.io.FileNotFoundException: 
> /var/lib/zookeeper/data/version-2/snapshot.1100000859 (Input/output error)
>         at java.base/java.io.FileOutputStream.open0(Native Method)
>         at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298)
>         at 
> java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237)
>         at 
> java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:187)
>         at 
> org.apache.zookeeper.server.persistence.SnapStream.getOutputStream(SnapStream.java:133)
>         at 
> org.apache.zookeeper.server.persistence.FileSnap.serialize(FileSnap.java:242)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:481)
>         at 
> org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:550)
>         at 
> org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:544)
>         at 
> org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:540)
>         at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:597)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1552)
> 2024-10-24T09:43:40.898+0000 [myid:] - ERROR 
> [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.u.ServiceUtils@48]
>  - Exiting JVM with code 10 {code}
> Expectation: When IO_fault experiment using chaos mesh is performed for 60 
> sec, all the zookeeper servers should recover by itself without any manual 
> intervention. Is it possible to have partial traffic when PV is hanged? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to