priyankar-stripe opened a new issue, #14583: URL: https://github.com/apache/iceberg/issues/14583
### Apache Iceberg version 1.8.1 ### Query engine Flink ### Please describe the bug 🐞 In https://github.com/apache/iceberg/pull/10523/files, we changed the cleanup logic to stop fetching the latest snapshot from the metastore and instead maintain an in-memory snapshot instance for cleanup operations. Specifically what we saw happen was: 1. Initial Commit Attempt: Flink attempts to commit snapshot `<snapshot_id>` to metastore. The commit succeeds on the metastore side, but Flink receives a transient network error and incorrectly marks the commit as failed. 2. Retry with Stale Metadata: RetryingMetaStoreClient retries the commit, but since the table has already been modified, metastore returns a `The table has been modified` error. This triggers a `CommitFailedException` (see https://github.com/apache/iceberg/blob/1.8.x/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java#L277-L278). 3. SnapshotProducer Retry: SnapshotProducer catches this exception and retries the operation. It reuses the same snapshot ID but generates a new manifest list file: `snap-<snapshot_id>-2-<uuid>.avro` **(note the incremented attempt number)**, different from the already-committed manifest list `snap-<snapshot_id>-1-<uuid>.avro`. 4. No-Op Detection: Since there are no actual changes between these two attempts (same snapshot content), Iceberg detects this as a no-op and skips the commit https://github.com/apache/iceberg/blob/1.8.x/core/src/main/java/org/apache/iceberg/SnapshotProducer.java#L448-L453. 5. Incorrect Cleanup: The cleanup logic then runs, but it incorrectly assumes `snap-<snapshot_id>-2-<uuid>.avro` is the committed manifest list (since it's the most recent attempt). It therefore deletes `snap-<snapshot_id>-1-<uuid>.avro` as an "uncommitted" file, thereby corrupting the active snapshot ### Willingness to contribute - [ ] I can contribute a fix for this bug independently - [x] I would be willing to contribute a fix for this bug with guidance from the Iceberg community - [ ] I cannot contribute a fix for this bug at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
