rionmonster opened a new pull request, #2504:
URL: https://github.com/apache/fluss/pull/2504

   ### Purpose
   
   Linked issue: close #2262  
   
   Per Issue https://github.com/apache/fluss/issues/2262, this pull request 
addresses a potential asynchronous time-gap issue during the commit 
notification flow that could result in the 
`IcebergRewriteITCase.testLogTableCompaction` test case failing (particularly 
during CI builds). 
   
   ### Brief change log
   
   This change improves the reliability of 
`IcebergRewriteITCase.testLogTableCompaction` by introducing a two-phase 
verification process for tiering completion. It adds 
`waitForIcebergSnapshotOffset()` to first verify commits at the source of truth 
(i.e. Iceberg), then uses an updated `assertReplicaStatus()` to wait for the 
async replica notification.
   
   #### Verification
   
   After significant exploration within [the previously proposed 
solution](https://github.com/apache/fluss/pull/2265) and within separate 
out-of-band conversations with @luoyuxia during investigation, a root cause 
along with related evidence was found to justify the time-gap issue. 
   
   ##### Diagnostics Gist
   A reproducible test can be found within [this gist 
here](https://gist.github.com/rionmonster/364ee6639a6aa267210e644bd9c12500) 
which introduces a `LakeTableOffsetAsyncGapTest` purely for diagnostics 
purposes.
   
   ##### Diagnostics Logs
   The above gist provides a series of logs to verify the timing issues 
(between actual values and expectations):
   ```
   1784 [ForkJoinPool-1-worker-19] INFO  
org.apache.fluss.server.replica.LakeTableOffsetAsyncGapTest [] - [TEST] Cycle 
10: Gap observed! Commit took 5 ms, notification delay was 1 ms
   1784 [ForkJoinPool-1-worker-19] INFO  
org.apache.fluss.server.replica.LakeTableOffsetAsyncGapTest [] - [TEST] 
Summary: Gap observed in 10/10 cycles
   1785 [ForkJoinPool-1-worker-19] INFO  
org.apache.fluss.server.replica.LakeTableOffsetAsyncGapTest [] - [TEST] Gap 
timings (ms): min=1, max=3, avg=1.2
   ...
   1928 [ForkJoinPool-1-worker-19] INFO  
org.apache.fluss.server.replica.LakeTableOffsetAsyncGapTest [] - [TEST] 
Immediate check after commit: lakeLogEndOffset=-1 (expected: -1 if async gap 
exists, 25 if sync)
   1928 [ForkJoinPool-1-worker-19] INFO  
org.apache.fluss.server.replica.LakeTableOffsetAsyncGapTest [] - [TEST] Rapid 
samples of lakeLogEndOffset: [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
   ```
   
   ### Tests
   
   Updates commit and status verifications for existing 
`FlinkIcebergTieringTestBase.assertReplicaStatus` and 
`FlinkIcebergTieringTestBase.waitForIcebergSnapshotOffset` to address the 
underlying flaky test itself.
   
   ### API and Format
   
   N/A
   
   ### Documentation
   
   N/A
   
   ### Reviewer(s) Requested
   
   @luoyuxia 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to