openinx commented on issue #2610: URL: https://github.com/apache/iceberg/issues/2610#issuecomment-851952014
> @openinx Can this be the reason behind duplicate data #2308? You mean you have did any compaction when the flink job were upsert rows into apache iceberg table ? We have also other Asia users encountered this issue, If they don't do any compaction then we will never encounter any duplicated rows but once we enable the compaction service then duplicated rows happen. Yes, that's indeed a bug ( #2308) that we will need to fixed. If you don't do any compaction, then we will need to consider other reasons, such as people may replay the duplicated change log events from mysql binlog to apache iceberg. For example, at timestamp t1, people scanned all the existing rows from mysql table and migrate them into apache iceberg by flink streaming job, then they start to migrate the incremental binlog events since the timestamp t1. If we don't record the binlog offset at timestamp t1 in a MySQL transaction, then we may choose to replay a bit more binlog events (which means exporting the binglog events before timestamp t1), in this case we may also encounter the duplicated rows in apache iceberg table. Because currently the apache iceberg table are maintaining the CDC events just as it happened, I mean if INSERT the same row twice, then the iceberg table will produce two duplicated rows rather than one row. This will be resolved if we provide an option at the flink streaming sink job side to indicate that w e will transfer all the insert rows as UPSERT(s). Pls see this [PR](https://github.com/apache/iceberg/pull/1996). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
