[ 
https://issues.apache.org/jira/browse/FLINK-38450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanquan Lv reassigned FLINK-38450:
----------------------------------

    Assignee: Spoorthi Basu

> flink-cdc sink to iceberg table record duplication
> --------------------------------------------------
>
>                 Key: FLINK-38450
>                 URL: https://issues.apache.org/jira/browse/FLINK-38450
>             Project: Flink
>          Issue Type: Bug
>          Components: Flink CDC
>    Affects Versions: 3.0.0
>            Reporter: ChaoFang
>            Assignee: Spoorthi Basu
>            Priority: Major
>              Labels: Flink-CDC, iceberg, pull-request-available
>             Fix For: cdc-3.5.0
>
>
> When multiple modifications to the same PrimaryId occur within a single 
> checkpoint and are flushed separately, records are written to different file 
> batches (data/eqDelete/posDelete files) sharing the same seq_number. During 
> reads, the eqDeleteFile fails to correctly filter stale data, resulting in 
> duplicate records.
> *Repro Steps* (in 1 checkpoint):
>  # Got Rcoed(id=1,name="a") insert event.
>  # Got flush event (Triggered from some other change).
>  # Rcoed(id=1,name="2") flush to w1-0001.parquet (with it deleteFiles).
>  # Got Rcoed(id=1,name="b") update event.
>  # New writer.
>  # Rcoed(id=1,name="b")  flush to w2-0001.parquet (with it deleteFiles).
>  # Got checkpoint commit event.
>  # commit w1-0001.parquet, w2-0001.parquet (with their deleteFiles) metadata 
> to snapshot.
> Since it is a new writer, the distribution of data records cannot be correct 
> (for example, the distribution of data records to different files by Hash).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to