[
https://issues.apache.org/jira/browse/FLINK-38450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yanquan Lv reassigned FLINK-38450:
----------------------------------
Assignee: Spoorthi Basu
> flink-cdc sink to iceberg table record duplication
> --------------------------------------------------
>
> Key: FLINK-38450
> URL: https://issues.apache.org/jira/browse/FLINK-38450
> Project: Flink
> Issue Type: Bug
> Components: Flink CDC
> Affects Versions: 3.0.0
> Reporter: ChaoFang
> Assignee: Spoorthi Basu
> Priority: Major
> Labels: Flink-CDC, iceberg, pull-request-available
> Fix For: cdc-3.5.0
>
>
> When multiple modifications to the same PrimaryId occur within a single
> checkpoint and are flushed separately, records are written to different file
> batches (data/eqDelete/posDelete files) sharing the same seq_number. During
> reads, the eqDeleteFile fails to correctly filter stale data, resulting in
> duplicate records.
> *Repro Steps* (in 1 checkpoint):
> # Got Rcoed(id=1,name="a") insert event.
> # Got flush event (Triggered from some other change).
> # Rcoed(id=1,name="2") flush to w1-0001.parquet (with it deleteFiles).
> # Got Rcoed(id=1,name="b") update event.
> # New writer.
> # Rcoed(id=1,name="b") flush to w2-0001.parquet (with it deleteFiles).
> # Got checkpoint commit event.
> # commit w1-0001.parquet, w2-0001.parquet (with their deleteFiles) metadata
> to snapshot.
> Since it is a new writer, the distribution of data records cannot be correct
> (for example, the distribution of data records to different files by Hash).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)