xloya edited a comment on pull request #4311:
URL: https://github.com/apache/iceberg/pull/4311#issuecomment-1064774168


   > Thanks for providing a patch @xloya.
   > 
   > It would help if. you could provide more context. Is this something that 
needs to be special cased for flink `upsert` mode in some way? Modifying `core` 
and removing tests causes me concern at first sight.
   > 
   > It's possible I missed something on the mailing list or some other 
discussion, forgive me if so. Is it possible there's an example scenario that 
can be given to help me understand?
   
   Of course, we have a scenario to write data to iceberg's v2 table through 
Flink CDC. They have non-primary key query scenarios. The current 
implementation in `core` will add a filter, which may lose the latest seq num 
equality delete files for Flink streaming writing.  
   E.g:  
   Table schema : (id int (primary key), date date)  
   When seq num=1, Flink writes a record with `id=1, date='2021-01-01'`, will 
insert a data record with `id=1, date='2021-01-01'`, and a equality delete data 
record with an `id=1, date='2021-01-01'`;  
   When seq num=2, writes a record with `id=1, date='2022-01-01'` to update, 
will insert a data record with `id=1, date='2022-01-01'`, and a equality delete 
record with `id=1 ,date='2022-01-01' `;  
   At this time, when using `select * from xxx where date < '2022-01-01'` to 
query, due to the addition of the filter, the equality delete file written when 
seq num=2 will be filtered out.  
   
   This is currently the easiest way to fix the problem. If we want to optimize 
for Flink upsert, then I think may need to read the latest records with the 
primary key that already exists in the table and write them to the equality 
delete file when writing, while instead of writing the inserted data to the 
equality delete file


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to