xBis7 opened a new pull request, #5917:
URL: https://github.com/apache/ozone/pull/5917

   ## What changes were proposed in this pull request?
   
   The test goes through these steps
   * Start with an inactive follower
   * Write keys
   * Start the inactive follower, which downloads a ratis snapshot
   * Write more keys
   * The follower now downloads an incremental snapshot
   * Delete some sst files from the follower's candidate dir
   * The follower tries to install an incremental snapshot, which fails
   * The follower installs a new snapshot
   * Check the metrics to verify the above steps 
   
   The metrics part was commented out by 
https://github.com/apache/ozone/pull/5673.
   
   The metrics depend on downloading the 3rd snapshot. During the failures, the 
corruption in the candidate dir isn’t picked up and the Ratis GrpcLogAppender 
actually returns SUCCESS and the snapshot installation isn’t repeated. 
   
   ### Making the test consistently fail
   
   The issue is with the code deleting the sst files. The steps go as following
   * Get a list of sst files
   * Shuffle the list
   * Get the first 3 sst files
   * Delete them
   
   The initial list that we get is always like this
   
   ```
   [000054.sst, 000057.sst, 000062.sst, 000063.sst, 000061.sst, 000058.sst, 
000053.sst, 000060.sst, 000055.sst, 000056.sst]
   ```
   
   During the failures, this is what the sst file list could potentially look 
like after deletes
   
   ```
   [000054.sst, 000057.sst, 000061.sst, 000058.sst, 000053.sst, 000060.sst, 
000055.sst]
   ```
   
   ```
   [000054.sst, 000061.sst, 000058.sst, 000053.sst, 000060.sst, 000055.sst, 
000056.sst
   ```
   
   In both cases 5-6 sst files in consecutive order, were left untouched. Due 
to the randomness, some times we end up leaving too many consecutive files 
untouched and as a result, the candidate dir isn't considered corrupted. During 
the corruption we expect to get a result `SNAPSHOT_UNAVAILABLE`.
   
   If we remove the shuffle and just delete the 3rd, the 4th and the last 
element of the list, we end up with 5 consecutive ssts untouched. [Check this 
commit.](https://github.com/xBis7/ozone/commit/5c60f9fedef570dd96a401de453103b99bf4ebbf).
 With that change, the test fails every time, 100/100 runs.
   
   https://github.com/xBis7/ozone/actions/runs/7400858782/job/20136301493
   
   ### Fix
   
   If we remove the randomness and delete every second file in the list, the 
test always passes. 
   
   I've been running it on repeat using the `flaky-test-check` ci, 10x10.
   
   https://github.com/xBis7/ozone/actions/runs/7401653879
   
   https://github.com/xBis7/ozone/actions/runs/7401950911 (1 failure)
   
   https://github.com/xBis7/ozone/actions/runs/7402888697
   
   https://github.com/xBis7/ozone/actions/runs/7403418317
   
   https://github.com/xBis7/ozone/actions/runs/7403423825 (1 failure)
   
   These 2 failures, both had to do with a timeout in writing the keys, which 
is appearing so rarely that it was probably a memory issue and it could have 
something to do with the fact that I was running just the method and not the 
entire class.
   
   The metrics issue is no longer there.
   
   ```
   at 
org.apache.hadoop.ozone.om.protocolPB.Hadoop3OmTransport.submitRequest(Hadoop3OmTransport.java:80)
   at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.submitRequest(OzoneManagerProtocolClientSideTranslatorPB.java:330)
   at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.updateKey(OzoneManagerProtocolClientSideTranslatorPB.java:831)
   at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.commitKey(OzoneManagerProtocolClientSideTranslatorPB.java:788)
   at 
org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.commitKey(BlockOutputStreamEntryPool.java:350)
   at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:580)
   at 
org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:105)
   at 
org.apache.hadoop.ozone.om.TestOzoneManagerHA.createKey(TestOzoneManagerHA.java:225)
   at 
org.apache.hadoop.ozone.om.TestOMRatisSnapshots.writeKeysToIncreaseLogIndex(TestOMRatisSnapshots.java:1066)
   at 
org.apache.hadoop.ozone.om.TestOMRatisSnapshots.testInstallIncrementalSnapshotWithFailure(TestOMRatisSnapshots.java:630)
   ```
   
   Check this file for the entire thread dump: 
   
[org.apache.hadoop.ozone.om.TestOMRatisSnapshots.txt](https://github.com/apache/ozone/files/13828129/org.apache.hadoop.ozone.om.TestOMRatisSnapshots.txt)
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-10004
   
   ## How was this patch tested?
   
   This patch is fixing a flaky test and it was tested using the new 
`flaky-test-check` ci.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to