Haoze Wu created ZOOKEEPER-4734: ----------------------------------- Summary: FuzzySnapshotRelatedTest becomes flaky when transient disk failure appears Key: ZOOKEEPER-4734 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4734 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.6.0 Reporter: Haoze Wu
In testPZxidUpdatedWhenLoadingSnapshot(), a quorum server is stopped and restarted to test for loading snapshots. However, during restarting of quorum server, we would call into ZkDataBase#loadDataBase(), from in which an IOException could be thrown because of transient disk failure. {code:java} public long loadDataBase() throws IOException { long zxid = snapLog.restore(dataTree, sessionsWithTimeouts, commitProposalPlaybackListener); // line 240 and IOException thrown here initialized = true; return zxid; } {code} In FileTxnSnapLog#restore {code:java} public long restore(DataTree dt, Map<Long, Integer> sessions, PlayBackListener listener) throws IOException { long deserializeResult = snapLog.deserialize(dt, sessions); // IOException ... }{code} Here is the stacktrace: {code:java} at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240) at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:862) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:848) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:201) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:124) at org.apache.zookeeper.server.quorum.QuorumPeerTestBase$MainThread.run(QuorumPeerTestBase.java:330) at java.lang.Thread.run(Thread.java:748) {code} Finally, because of this IOException, restart would be failed and test failed. In terms of the fix, we could either retry the test like the one proposed by ZOOKEEPER-3157 or we could add some configurable retry mechanism to ZkDataBase#loadDataBase() to tolerate possible transient disk failure. -- This message was sent by Atlassian Jira (v8.20.10#820010)