yadavay-amzn opened a new pull request, #16324:
URL: https://github.com/apache/iceberg/pull/16324

   ## Problem
   
   Fixes #15487.
   
   When Flink TableMaintenance runs both `ExpireSnapshots` and 
`DeleteOrphanFiles`, manifest list files of live snapshots are incorrectly 
deleted as orphans, causing `NotFoundException` in subsequent `ExpireSnapshots` 
runs.
   
   ## Root cause
   
   `ListMetadataFiles` loads the table once at operator startup (`open()`) and 
never calls `table.refresh()` in `processElement()`. It only emits manifest 
list and manifest file paths for snapshots that existed when the Flink job 
started.
   
   Any snapshot added after job start has its metadata files missing from the 
"referenced" set that `DeleteOrphanFiles` uses. When those manifest lists are 
older than `minAge`, `OrphanFilesDetector` classifies them as orphans and 
`DeleteFilesProcessor` deletes them.
   
   On the next maintenance cycle, `ExpireSnapshots` tries to read those 
manifest lists in `IncrementalFileCleanup.cleanFiles()` and fails with 
`NotFoundException`.
   
   This explains why:
   - The bug only occurs with `DeleteOrphanFiles` enabled (it is the one 
incorrectly deleting the files)
   - The bug never occurs with `ExpireSnapshots` alone (it only deletes 
manifest lists of snapshots it has already expired and read)
   - The bug becomes more likely over time (more snapshots added after job 
start = more unprotected manifest lists)
   
   ## Fix
   
   Add `table.refresh()` at the top of `ListMetadataFiles.processElement()`, 
matching what `MetadataTablePlanner` already does. This ensures the 
"referenced" set always reflects the current table state.
   
   Applied to all Flink versions (v1.20, v2.0, v2.1).
   
   ## Generative AI
   
   Generated-by: Claude Opus 4.7
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to