hililiwei commented on pull request #4316: URL: https://github.com/apache/iceberg/pull/4316#issuecomment-1066097462
We seem to have the same issues with @xloya. Here are some comments from https://github.com/apache/iceberg/pull/4311 --------------------------------- > Of course, we have a scenario to write data to iceberg's v2 table through Flink CDC. They have non-primary key query scenarios. The current implementation in `core` will add a filter, which may lose the latest seq num equality delete files for Flink streaming writing. E.g: Table schema : (id int (primary key), date date) When seq num=1, Flink writes a record with `id=1, date='2021-01-01'`, will insert a data record with `id=1, date='2021-01-01'`, and a equality delete data record with an `id=1, date='2021-01-01'`; When seq num=2, writes a record with `id=1, date='2022-01-01'` to update, will insert a data record with `id=1, date='2022-01-01'`, and a equality delete record with `id=1 ,date='2022-01-01' `; At this time, when using `select * from xxx where date < '2022-01-01'` to query, due to the addition of the filter, the equality delete file written when seq num=2 will be filtered out. > > This is currently the easiest way to fix the problem. If we want to optimize for Flink upsert, then I think may need to read the latest records with the primary key that already exists in the table and write them to the equality delete file when writing, while instead of writing the inserted data to the equality delete file https://github.com/apache/iceberg/pull/4311#issuecomment-1064774168 --------------------------------- > We seem to have the same issues. > > And it only happens on our Parquet table (doesn't happen on our Avro table). After analysis, we found that the problem occurred in the metric (such as `upper_bounds` \ `lower_bounds` )filtering process of the MANIFEST file (avro tables did not generate these metric data). > > Our solution is different from this PR. We try to trim the row filter fileds used for metric filtering. For deleted manifest file, only the metric filed id in the `equality_ids` will be filtered, please refer to #4316 for details. > > I'm not sure which way is better, or that there is another better solution. > > I'm sorry if your PR is to address a different issues. > > Thx. 😄 --------------------------------- -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
