wombatu-kun opened a new pull request, #19030: URL: https://github.com/apache/hudi/pull/19030
### Describe the issue this Pull Request addresses `ITTestHoodieDataSource.testStreamReadFromSpecifiedCommitWithChangelog` is intermittently failing in CI with a streaming-read row-count mismatch, e.g. expecting the 11 merged rows but receiving 9, with the entire `par2` partition (`id3`, `id4`) missing. It surfaced on an unrelated PR's `test-flink-1 (flink2.1, 1.11.4, 1.15.2)` job: https://github.com/apache/hudi/actions/runs/27664812470/job/81816514280 These streaming reads are collected with `CollectTableSink` and terminated by a forced `SuccessException` once the expected row count is reached, with up to a 30s await. `fetchResultWithExpectedNum` tolerates a known benign teardown race (an `IOException("Stream is closed")`, or a `ParquetColumnarRowSplitReader#readNextRowGroup` NPE) that can fire when the source's `SplitFetcher` closes its stream during the job's cascading shutdown. That tolerance (added in #19019) assumed the race only fires after the sink has already collected its expected rows. Under CI load it can instead fire before the read completes, so the job ends with fewer rows than expected; the tolerated exception is then swallowed and the test asserts on the incomplete result, producing the confusing row-count mismatch. ### Summary and Changelog `fetchResultWithExpectedNum` no longer assumes that a tolerated terminal failure implies a complete read. The submit-and-collect path is wrapped in a new `submitAndFetchWithRetry`, which re-reads (up to `MAX_STREAM_READ_ATTEMPTS = 3`) when the collected row count is below the expectation. Re-reading the already committed table is idempotent (the collect sink clears its global result on each submit), so a transient teardown race no longer leaks into the assertion. The swallowed non-`SuccessException` cause is now logged so an incomplete read is diagnosable. Behavior is unchanged on the happy path: a complete first read returns immediately with no retry. The change is confined to the test harness in `ITTestHoodieDataSource`; no production code is touched. ### Impact De-flakes `ITTestHoodieDataSource` streaming-read tests (and any test going through `execSelectSqlWithExpectedNum`). No public API, on-disk format, or production behavior change. ### Risk Level low. Test-only change. The retry is a no-op on the happy path and only re-reads an idempotent, already-committed table when a read comes back short. Validated locally on JDK 11 / flink1.20: 60 consecutive runs of the failing parameterization pass, and a fault-injection run that truncates the first attempt (reproducing the exact 9-of-11 CI symptom) confirms the retry recovers the full result; the full test method (6 parameterizations) passes with 0 Checkstyle violations. ### Documentation Update none ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [x] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
