[GitHub] [iceberg] hililiwei commented on pull request #4316: Core:Remove unnecessary row filtering in deleted manifest file

GitBox Sun, 13 Mar 2022 06:07:41 -0700


hililiwei commented on pull request #4316:
URL: https://github.com/apache/iceberg/pull/4316#issuecomment-1066097462



   
   We seem to have the same issues with @xloya. Here are some comments from 
https://github.com/apache/iceberg/pull/4311
   
   ---------------------------------
    
   > Of course, we have a scenario to write data to iceberg's v2 table through 
Flink CDC. They have non-primary key query scenarios. The current 
implementation in `core` will add a filter, which may lose the latest seq num 
equality delete files for Flink streaming writing. E.g: Table schema : (id int 
(primary key), date date) When seq num=1, Flink writes a record with `id=1, 
date='2021-01-01'`, will insert a data record with `id=1, date='2021-01-01'`, 
and a equality delete data record with an `id=1, date='2021-01-01'`; When seq 
num=2, writes a record with `id=1, date='2022-01-01'` to update, will insert a 
data record with `id=1, date='2022-01-01'`, and a equality delete record with 
`id=1 ,date='2022-01-01' `; At this time, when using `select * from xxx where 
date < '2022-01-01'` to query, due to the addition of the filter, the equality 
delete file written when seq num=2 will be filtered out.
   > 
   > This is currently the easiest way to fix the problem. If we want to 
optimize for Flink upsert, then I think may need to read the latest records 
with the primary key that already exists in the table and write them to the 
equality delete file when writing, while instead of writing the inserted data 
to the equality delete file
   
   https://github.com/apache/iceberg/pull/4311#issuecomment-1064774168
   
   ---------------------------------
   > We seem to have the same issues.
   > 
   > And it only happens on our Parquet table (doesn't happen on our Avro 
table). After analysis, we found that the problem occurred in the metric （such 
as `upper_bounds` \ `lower_bounds` ）filtering process of the MANIFEST file 
(avro tables did not generate these metric data).
   > 
   > Our solution is different from this PR. We try to trim the row filter 
fileds used for metric filtering. For deleted manifest file, only the metric 
filed id in the `equality_ids` will be filtered, please refer to #4316 for 
details.
   > 
   > I'm not sure which way is better, or that there is another better solution.
   > 
   > I'm sorry if your PR is to address a different issues.
   > 
   > Thx. 😄
   
   ---------------------------------


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] hililiwei commented on pull request #4316: Core:Remove unnecessary row filtering in deleted manifest file

Reply via email to