xloya edited a comment on pull request #4311: URL: https://github.com/apache/iceberg/pull/4311#issuecomment-1064774168
> Thanks for providing a patch @xloya. > > It would help if. you could provide more context. Is this something that needs to be special cased for flink `upsert` mode in some way? Modifying `core` and removing tests causes me concern at first sight. > > It's possible I missed something on the mailing list or some other discussion, forgive me if so. Is it possible there's an example scenario that can be given to help me understand? Of course, we have a scenario to write data to iceberg's v2 table through Flink CDC. They have non-primary key query scenarios. The current implementation in `core` will add a filter, which may lose the latest seq num equality delete files for Flink streaming writing. E.g: Table schema : (id int (primary key), date date) When seq num=1, Flink writes a record with `id=1, date='2021-01-01'`, will insert a data record with `id=1, date='2021-01-01'`, and a equality delete data record with an `id=1, date='2021-01-01'`; When seq num=2, writes a record with `id=1, date='2022-01-01'` to update, will insert a data record with `id=1, date='2022-01-01'`, and a equality delete record with `id=1 ,date='2022-01-01' `; At this time, when using `select * from xxx where date < '2022-01-01'` to query, due to the addition of the filter, the equality delete file written when seq num=2 will be filtered out. This is currently the easiest way to fix the problem. If we want to optimize for Flink upsert, then I think may need to read the latest records with the primary key that already exists in the table and write them to the equality delete file when writing, while instead of writing the inserted data to the equality delete file -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
