chihsuan opened a new pull request, #10451: URL: https://github.com/apache/ozone/pull/10451
## What changes were proposed in this pull request? `TestOMRatisSnapshots` takes ~575s locally (12 tests: 6 active methods x 2 parameters since HDDS-14721 made the class parameterized). Profiling the run shows a large share of the time is fixed sleeps and oversized cluster setup rather than the code under test. This PR removes the dead time without weakening any assertion: 1. **Remove the per-key `Thread.sleep(100)` in `writeKeysToIncreaseLogIndex`.** Each `createKey()` is a synchronous RPC producing 2 Ratis transactions, so the loop converges without sleeping; callers only require reaching at least the target index. The sleep was introduced without documented rationale in the HDDS-3741 refactor (the original HDDS-1649 loop had no sleep). This alone was ~40s of pure sleep per parameter run. 2. **Replace/remove three hardcoded `Thread.sleep(5000)`.** Two were redundant (immediately followed by 30s-polling assertions); the third now polls for the last written key to appear in the follower's key table, the same pattern already used elsewhere in this class. 3. **Use a single datanode.** All key writes in this class use `ReplicationFactor.ONE`, so 3 DNs only add startup/teardown cost for each of the 12 cluster instances. (`setNumDatanodes` returns the base builder type, hence the builder chain is split.) 4. **Fix a pre-existing race in `testInstallOldCheckpointFailure` exposed by (1).** The test reads `followerTermIndex` and asserts a log message containing that exact TermIndex, but trailing Ratis applies can advance the index between the test's read and `installCheckpoint()`'s internal read (observed: expected `i:202`, OM logged `i:203`). The old per-key sleep happened to mask this. The fix waits for the follower's applied index to reach the leader's (quiescence) before reading, instead of relying on sleep timing. Measured locally (M-series MacBook): 574.3s before, 483.9s / 525.7s on two full runs after (run-to-run variance from machine load), all 12 tests passing. Follow-up (separate Jira, to be filed under HDDS-9000): limit the HDDS-14721 class-level parameterization to the tests that actually exercise the checkpoint transfer format, which roughly halves the remaining cost. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-10310 ## How was this patch tested? Full `TestOMRatisSnapshots` class run locally with both parameters: ``` mvn -pl hadoop-ozone/integration-test test -Dtest=TestOMRatisSnapshots Tests run: 12, Failures: 0, Errors: 0, Skipped: 0 ``` Repeated full-class runs to check for flakiness (a third run is in progress; will update before marking ready for review). Per-test comparison against the baseline shows the affected tests improving consistently, e.g. `testInstallSnapshotWithClientWrite` 50-55s -> 33-39s, `testInstallSnapshotWithClientRead` 44s -> 26-37s, `testInstallOldCheckpointFailure` 33-37s -> 23-28s. `checkstyle`, `rat`, and `author` checks pass on the module. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
