fcfangcc opened a new pull request, #4269: URL: https://github.com/apache/flink-cdc/pull/4269
# Summary This PR addresses a critical issue in the Flink CDC Iceberg sink where task interruptions or restarts during the two-phase commit (2PC) process could result in duplicate data being committed to Iceberg tables. # Problem In the previous implementation, the Iceberg sink did not track which Flink checkpoints had already been successfully committed to the Iceberg snapshot. If a task failed after a successful Iceberg commit but before Flink could acknowledge the checkpoint completion, or if a commit was retried, the same set of data could be committed again, leading to data inconsistency and duplication. # Solution We have introduced a checkpoint-id tracking mechanism inspired by standard idempotent commit patterns. Key Changes: Idempotent Commits in IcebergCommitter: During the commit phase, the current Flink checkpointId is now stored in the Iceberg snapshot summary under the key > flink-cdc-checkpoint-id Before performing a new commit, IcebergCommitter checks the currentSnapshot of the target table. If the lastCheckpointId in the summary matches the current checkpointId, the commit is skipped with a warning, preventing double-committing the same data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
