fcfangcc opened a new pull request, #4269:
URL: https://github.com/apache/flink-cdc/pull/4269

   # Summary
   This PR addresses a critical issue in the Flink CDC Iceberg sink where task 
interruptions or restarts during the two-phase commit (2PC) process could 
result in duplicate data being committed to Iceberg tables.
   
   # Problem
   In the previous implementation, the Iceberg sink did not track which Flink 
checkpoints had already been successfully committed to the Iceberg snapshot. If 
a task failed after a successful Iceberg commit but before Flink could 
acknowledge the checkpoint completion, or if a commit was retried, the same set 
of data could be committed again, leading to data inconsistency and duplication.
   
   # Solution
   We have introduced a checkpoint-id tracking mechanism inspired by standard 
idempotent commit patterns.
   
   Key Changes:
   Idempotent Commits in IcebergCommitter:
   
   During the commit phase, the current Flink checkpointId is now stored in the 
Iceberg snapshot summary under the key    
   
   > flink-cdc-checkpoint-id
   
   Before performing a new commit, IcebergCommitter checks the currentSnapshot 
of the target table. If the lastCheckpointId in the summary matches the current 
checkpointId, the commit is skipped with a warning, preventing 
double-committing the same data.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to