Haoze Wu created ZOOKEEPER-4734:
-----------------------------------

             Summary: FuzzySnapshotRelatedTest becomes flaky when transient 
disk failure appears
                 Key: ZOOKEEPER-4734
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4734
             Project: ZooKeeper
          Issue Type: Bug
          Components: tests
    Affects Versions: 3.6.0
            Reporter: Haoze Wu


In testPZxidUpdatedWhenLoadingSnapshot(), a quorum server is stopped and 
restarted to test for loading snapshots. However, during restarting of quorum 
server, we would call into ZkDataBase#loadDataBase(), from in which an 
IOException could be thrown because of transient disk failure. 
{code:java}
public long loadDataBase() throws IOException {
    long zxid = snapLog.restore(dataTree, sessionsWithTimeouts,   
commitProposalPlaybackListener); // line 240 and IOException thrown here
    initialized = true;
    return zxid;
} {code}
In FileTxnSnapLog#restore

 
{code:java}
public long restore(DataTree dt, Map<Long, Integer> sessions,
                    PlayBackListener listener) throws IOException {
    long deserializeResult = snapLog.deserialize(dt, sessions); // IOException  
       
...
}{code}
Here is the stacktrace: 
{code:java}
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java)
        at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:862)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:848)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:201)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:124)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerTestBase$MainThread.run(QuorumPeerTestBase.java:330)
        at java.lang.Thread.run(Thread.java:748) {code}
Finally, because of this IOException, restart would be failed and test failed. 

In terms of the fix, we could either retry the test like the one proposed by 
ZOOKEEPER-3157 or we could add some configurable retry mechanism to 
ZkDataBase#loadDataBase() to tolerate possible transient disk failure. 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to