chihsuan opened a new pull request, #10451:
URL: https://github.com/apache/ozone/pull/10451

   ## What changes were proposed in this pull request?
   
   `TestOMRatisSnapshots` takes ~575s locally (12 tests: 6 active methods x 2 
parameters since HDDS-14721 made the class parameterized). Profiling the run 
shows a large share of the time is fixed sleeps and oversized cluster setup 
rather than the code under test. This PR removes the dead time without 
weakening any assertion:
   
   1. **Remove the per-key `Thread.sleep(100)` in 
`writeKeysToIncreaseLogIndex`.** Each `createKey()` is a synchronous RPC 
producing 2 Ratis transactions, so the loop converges without sleeping; callers 
only require reaching at least the target index. The sleep was introduced 
without documented rationale in the HDDS-3741 refactor (the original HDDS-1649 
loop had no sleep). This alone was ~40s of pure sleep per parameter run.
   2. **Replace/remove three hardcoded `Thread.sleep(5000)`.** Two were 
redundant (immediately followed by 30s-polling assertions); the third now polls 
for the last written key to appear in the follower's key table, the same 
pattern already used elsewhere in this class.
   3. **Use a single datanode.** All key writes in this class use 
`ReplicationFactor.ONE`, so 3 DNs only add startup/teardown cost for each of 
the 12 cluster instances. (`setNumDatanodes` returns the base builder type, 
hence the builder chain is split.)
   4. **Fix a pre-existing race in `testInstallOldCheckpointFailure` exposed by 
(1).** The test reads `followerTermIndex` and asserts a log message containing 
that exact TermIndex, but trailing Ratis applies can advance the index between 
the test's read and `installCheckpoint()`'s internal read (observed: expected 
`i:202`, OM logged `i:203`). The old per-key sleep happened to mask this. The 
fix waits for the follower's applied index to reach the leader's (quiescence) 
before reading, instead of relying on sleep timing.
   
   Measured locally (M-series MacBook): 574.3s before, 483.9s / 525.7s on two 
full runs after (run-to-run variance from machine load), all 12 tests passing.
   
   Follow-up (separate Jira, to be filed under HDDS-9000): limit the HDDS-14721 
class-level parameterization to the tests that actually exercise the checkpoint 
transfer format, which roughly halves the remaining cost.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-10310
   
   ## How was this patch tested?
   
   Full `TestOMRatisSnapshots` class run locally with both parameters:
   
   ```
   mvn -pl hadoop-ozone/integration-test test -Dtest=TestOMRatisSnapshots
   Tests run: 12, Failures: 0, Errors: 0, Skipped: 0
   ```
   
   Repeated full-class runs to check for flakiness (a third run is in progress; 
will update before marking ready for review). Per-test comparison against the 
baseline shows the affected tests improving consistently, e.g. 
`testInstallSnapshotWithClientWrite` 50-55s -> 33-39s, 
`testInstallSnapshotWithClientRead` 44s -> 26-37s, 
`testInstallOldCheckpointFailure` 33-37s -> 23-28s. `checkstyle`, `rat`, and 
`author` checks pass on the module.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to