openinx commented on issue #2610:
URL: https://github.com/apache/iceberg/issues/2610#issuecomment-851952014


   > @openinx Can this be the reason behind duplicate data #2308?
   
   You mean you have did any compaction when the flink job were upsert rows 
into apache iceberg table ?   We have also other Asia users encountered this 
issue,   If they don't do any compaction then we will never encounter any 
duplicated rows  but once we enable the compaction service then duplicated rows 
happen.  Yes, that's indeed a bug ( #2308) that we will need to fixed. 
   
   If you don't do any compaction, then we will need to consider other reasons, 
such as  people may replay the duplicated change log events from mysql binlog 
to apache iceberg.  For example,  at timestamp t1, people scanned all the 
existing rows from mysql table and migrate them into apache iceberg by flink 
streaming job,   then they start to migrate the incremental binlog events since 
the timestamp t1.  If we don't record the binlog offset at timestamp t1 in a 
MySQL transaction, then we may choose to replay a bit more binlog events (which 
means exporting the binglog events before timestamp t1), in this case we may 
also encounter the duplicated rows in apache iceberg table.  Because currently 
the apache iceberg table are maintaining the CDC events just as it happened,  I 
mean if INSERT the same row twice, then  the iceberg table will produce two 
duplicated rows rather than one row.   This will be resolved if we provide an 
option at the flink streaming sink job side to indicate that w
 e will transfer all the insert rows as UPSERT(s). Pls see this 
[PR](https://github.com/apache/iceberg/pull/1996).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to