xBis7 opened a new pull request, #5917: URL: https://github.com/apache/ozone/pull/5917
## What changes were proposed in this pull request? The test goes through these steps * Start with an inactive follower * Write keys * Start the inactive follower, which downloads a ratis snapshot * Write more keys * The follower now downloads an incremental snapshot * Delete some sst files from the follower's candidate dir * The follower tries to install an incremental snapshot, which fails * The follower installs a new snapshot * Check the metrics to verify the above steps The metrics part was commented out by https://github.com/apache/ozone/pull/5673. The metrics depend on downloading the 3rd snapshot. During the failures, the corruption in the candidate dir isn’t picked up and the Ratis GrpcLogAppender actually returns SUCCESS and the snapshot installation isn’t repeated. ### Making the test consistently fail The issue is with the code deleting the sst files. The steps go as following * Get a list of sst files * Shuffle the list * Get the first 3 sst files * Delete them The initial list that we get is always like this ``` [000054.sst, 000057.sst, 000062.sst, 000063.sst, 000061.sst, 000058.sst, 000053.sst, 000060.sst, 000055.sst, 000056.sst] ``` During the failures, this is what the sst file list could potentially look like after deletes ``` [000054.sst, 000057.sst, 000061.sst, 000058.sst, 000053.sst, 000060.sst, 000055.sst] ``` ``` [000054.sst, 000061.sst, 000058.sst, 000053.sst, 000060.sst, 000055.sst, 000056.sst ``` In both cases 5-6 sst files in consecutive order, were left untouched. Due to the randomness, some times we end up leaving too many consecutive files untouched and as a result, the candidate dir isn't considered corrupted. During the corruption we expect to get a result `SNAPSHOT_UNAVAILABLE`. If we remove the shuffle and just delete the 3rd, the 4th and the last element of the list, we end up with 5 consecutive ssts untouched. [Check this commit.](https://github.com/xBis7/ozone/commit/5c60f9fedef570dd96a401de453103b99bf4ebbf). With that change, the test fails every time, 100/100 runs. https://github.com/xBis7/ozone/actions/runs/7400858782/job/20136301493 ### Fix If we remove the randomness and delete every second file in the list, the test always passes. I've been running it on repeat using the `flaky-test-check` ci, 10x10. https://github.com/xBis7/ozone/actions/runs/7401653879 https://github.com/xBis7/ozone/actions/runs/7401950911 (1 failure) https://github.com/xBis7/ozone/actions/runs/7402888697 https://github.com/xBis7/ozone/actions/runs/7403418317 https://github.com/xBis7/ozone/actions/runs/7403423825 (1 failure) These 2 failures, both had to do with a timeout in writing the keys, which is appearing so rarely that it was probably a memory issue and it could have something to do with the fact that I was running just the method and not the entire class. The metrics issue is no longer there. ``` at org.apache.hadoop.ozone.om.protocolPB.Hadoop3OmTransport.submitRequest(Hadoop3OmTransport.java:80) at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.submitRequest(OzoneManagerProtocolClientSideTranslatorPB.java:330) at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.updateKey(OzoneManagerProtocolClientSideTranslatorPB.java:831) at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.commitKey(OzoneManagerProtocolClientSideTranslatorPB.java:788) at org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.commitKey(BlockOutputStreamEntryPool.java:350) at org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:580) at org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:105) at org.apache.hadoop.ozone.om.TestOzoneManagerHA.createKey(TestOzoneManagerHA.java:225) at org.apache.hadoop.ozone.om.TestOMRatisSnapshots.writeKeysToIncreaseLogIndex(TestOMRatisSnapshots.java:1066) at org.apache.hadoop.ozone.om.TestOMRatisSnapshots.testInstallIncrementalSnapshotWithFailure(TestOMRatisSnapshots.java:630) ``` Check this file for the entire thread dump: [org.apache.hadoop.ozone.om.TestOMRatisSnapshots.txt](https://github.com/apache/ozone/files/13828129/org.apache.hadoop.ozone.om.TestOMRatisSnapshots.txt) ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-10004 ## How was this patch tested? This patch is fixing a flaky test and it was tested using the new `flaky-test-check` ci. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
