wombatu-kun opened a new pull request, #19030:
URL: https://github.com/apache/hudi/pull/19030

   ### Describe the issue this Pull Request addresses
   
   `ITTestHoodieDataSource.testStreamReadFromSpecifiedCommitWithChangelog` is 
intermittently failing in CI with a streaming-read row-count mismatch, e.g. 
expecting the 11 merged rows but receiving 9, with the entire `par2` partition 
(`id3`, `id4`) missing. It surfaced on an unrelated PR's `test-flink-1 
(flink2.1, 1.11.4, 1.15.2)` job: 
https://github.com/apache/hudi/actions/runs/27664812470/job/81816514280
   
   These streaming reads are collected with `CollectTableSink` and terminated 
by a forced `SuccessException` once the expected row count is reached, with up 
to a 30s await. `fetchResultWithExpectedNum` tolerates a known benign teardown 
race (an `IOException("Stream is closed")`, or a 
`ParquetColumnarRowSplitReader#readNextRowGroup` NPE) that can fire when the 
source's `SplitFetcher` closes its stream during the job's cascading shutdown. 
That tolerance (added in #19019) assumed the race only fires after the sink has 
already collected its expected rows. Under CI load it can instead fire before 
the read completes, so the job ends with fewer rows than expected; the 
tolerated exception is then swallowed and the test asserts on the incomplete 
result, producing the confusing row-count mismatch.
   
   ### Summary and Changelog
   
   `fetchResultWithExpectedNum` no longer assumes that a tolerated terminal 
failure implies a complete read. The submit-and-collect path is wrapped in a 
new `submitAndFetchWithRetry`, which re-reads (up to `MAX_STREAM_READ_ATTEMPTS 
= 3`) when the collected row count is below the expectation. Re-reading the 
already committed table is idempotent (the collect sink clears its global 
result on each submit), so a transient teardown race no longer leaks into the 
assertion. The swallowed non-`SuccessException` cause is now logged so an 
incomplete read is diagnosable. Behavior is unchanged on the happy path: a 
complete first read returns immediately with no retry.
   
   The change is confined to the test harness in `ITTestHoodieDataSource`; no 
production code is touched.
   
   ### Impact
   
   De-flakes `ITTestHoodieDataSource` streaming-read tests (and any test going 
through `execSelectSqlWithExpectedNum`). No public API, on-disk format, or 
production behavior change.
   
   ### Risk Level
   
   low. Test-only change. The retry is a no-op on the happy path and only 
re-reads an idempotent, already-committed table when a read comes back short. 
Validated locally on JDK 11 / flink1.20: 60 consecutive runs of the failing 
parameterization pass, and a fault-injection run that truncates the first 
attempt (reproducing the exact 9-of-11 CI symptom) confirms the retry recovers 
the full result; the full test method (6 parameterizations) passes with 0 
Checkstyle violations.
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to