rionmonster opened a new pull request, #2504: URL: https://github.com/apache/fluss/pull/2504
### Purpose Linked issue: close #2262 Per Issue https://github.com/apache/fluss/issues/2262, this pull request addresses a potential asynchronous time-gap issue during the commit notification flow that could result in the `IcebergRewriteITCase.testLogTableCompaction` test case failing (particularly during CI builds). ### Brief change log This change improves the reliability of `IcebergRewriteITCase.testLogTableCompaction` by introducing a two-phase verification process for tiering completion. It adds `waitForIcebergSnapshotOffset()` to first verify commits at the source of truth (i.e. Iceberg), then uses an updated `assertReplicaStatus()` to wait for the async replica notification. #### Verification After significant exploration within [the previously proposed solution](https://github.com/apache/fluss/pull/2265) and within separate out-of-band conversations with @luoyuxia during investigation, a root cause along with related evidence was found to justify the time-gap issue. ##### Diagnostics Gist A reproducible test can be found within [this gist here](https://gist.github.com/rionmonster/364ee6639a6aa267210e644bd9c12500) which introduces a `LakeTableOffsetAsyncGapTest` purely for diagnostics purposes. ##### Diagnostics Logs The above gist provides a series of logs to verify the timing issues (between actual values and expectations): ``` 1784 [ForkJoinPool-1-worker-19] INFO org.apache.fluss.server.replica.LakeTableOffsetAsyncGapTest [] - [TEST] Cycle 10: Gap observed! Commit took 5 ms, notification delay was 1 ms 1784 [ForkJoinPool-1-worker-19] INFO org.apache.fluss.server.replica.LakeTableOffsetAsyncGapTest [] - [TEST] Summary: Gap observed in 10/10 cycles 1785 [ForkJoinPool-1-worker-19] INFO org.apache.fluss.server.replica.LakeTableOffsetAsyncGapTest [] - [TEST] Gap timings (ms): min=1, max=3, avg=1.2 ... 1928 [ForkJoinPool-1-worker-19] INFO org.apache.fluss.server.replica.LakeTableOffsetAsyncGapTest [] - [TEST] Immediate check after commit: lakeLogEndOffset=-1 (expected: -1 if async gap exists, 25 if sync) 1928 [ForkJoinPool-1-worker-19] INFO org.apache.fluss.server.replica.LakeTableOffsetAsyncGapTest [] - [TEST] Rapid samples of lakeLogEndOffset: [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1] ``` ### Tests Updates commit and status verifications for existing `FlinkIcebergTieringTestBase.assertReplicaStatus` and `FlinkIcebergTieringTestBase.waitForIcebergSnapshotOffset` to address the underlying flaky test itself. ### API and Format N/A ### Documentation N/A ### Reviewer(s) Requested @luoyuxia -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
