shangxinli opened a new pull request, #18766:
URL: https://github.com/apache/hudi/pull/18766

   ### Describe the issue this Pull Request addresses
   
   De-flakes the three sibling `TestHudi*FileOperations` test classes in 
`hudi-trino-plugin`. They use a fixed `Thread.sleep(1000L)` to wait for async 
table-stats computation before reading and asserting on OpenTelemetry spans. 
The sleep races with variable stats latency — when stats from query N is still 
running while query N+1's measurement happens, the still-running async work 
emits spans that land in the (just-reset) exporter for query N+1, scrambling 
the expected counts.
   
   **Symptom**: paired tests in `TestHudiNoCacheFileOperations` fail with a 
symmetric off-by-N pattern — `testJoin` missing exactly the same metadata-table 
operations that `testSelectWithFilter` has extra — depending on test execution 
order and stats-computation timing. Same pattern can hit the other two sibling 
classes.
   
   Observed on https://github.com/apache/hudi/pull/18765, passed on rerun 
without code change — classic flake signature.
   
   ### Summary and Changelog
   
   Replaced the fixed sleep with a poll loop that returns once two consecutive 
span snapshots (200ms apart) agree, bounded by a 30-second ceiling. This is the 
deterministic signal that async work has settled — all stats spans for the 
current query are accounted for and nothing new is landing — so the assertion 
sees a stable, complete count.
   
   - `TestHudiNoCacheFileOperations.assertFileSystemAccesses`: replaced 
`Thread.sleep(1000L)` with `waitForStableSpans()`.
   - `TestHudiMemoryCacheFileOperations.assertFileSystemAccesses`: same change.
   - `TestHudiAlluxioCacheFileOperations.assertFileSystemAccesses`: same change.
   
   ### Impact
   
   Test-only. No production-code changes.
   
   The poll loop returns in ~400ms on the fast path (one extra read after the 
first stable snapshot), compared to the previous unconditional 1s sleep — so 
the typical case is *faster*, not slower. The 30s ceiling is the worst-case 
bound when stats genuinely take that long, in which case the assertion will 
fail with whatever was last read (rather than hanging).
   
   ### Risk Level
   
   **low** — test-only change, identical fix applied to three sibling classes, 
deterministic root-cause fix rather than a workaround.
   
   Verified: `mvn test-compile` in `hudi-trino-plugin` → BUILD SUCCESS.
   
   ### Documentation Update
   
   None.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to