rionmonster commented on code in PR #2265:
URL: https://github.com/apache/fluss/pull/2265#discussion_r2678116673
##########
fluss-lake/fluss-lake-iceberg/src/test/java/org/apache/fluss/lake/iceberg/maintenance/IcebergRewriteITCase.java:
##########
@@ -186,6 +186,9 @@ void testLogTableCompaction() throws Exception {
t1, t1Bucket, ++i, true,
Collections.singletonList(row(1, "v1"))));
checkFileStatusInIcebergTable(t1, 3, false);
+ // Ensure tiering job has fully processed the previous writes
+ assertReplicaStatus(t1Bucket, i);
Review Comment:
@luoyuxia
You were right! After adding more logs than I would like to mention and
performing some rituals to the debugging gods, I was able to verify the issue,
which seemed to be centered around the state transitions themselves.
It appears that during the processes for requesting tables in
`LakeTableTieringManager.requestTable()` was relying on the existing pending
queue (`pendingTieringTables`) to serve the appropriate table. Because pending
entries could be duplicated or stale (e.g., timers firing late, retries, races
between state transitions, etc.), it was possible to request a table from the
queue that was no longer actually in the "Pending" state. We’d then try to move
it to Tiering, fail the state transition, but still hand it out anyway — which
could confuse the tiering service and lead to flaky behavior under repetition.
Separately, some state transition side effects (like scheduling delayed
tiering) could run asynchronously and request the state of the table before its
new state was recorded, causing valid transitions to be rejected if a timer
fired early.
I'm wrapping up verification now, but should have an updated PR coming soon.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]