Thanks, Ryan, that’s helpful. I’m curious – is there a Flink-native means to write delete files, then? I’ve written mine through the native Java AP e.g. creating an EqualityDeleteWriter I and applied them to the ‘main’ Table in a transaction.
ah From: Ryan Blue <b...@tabular.io> Sent: Monday, June 6, 2022 1:52 PM To: Iceberg Dev List <dev@iceberg.apache.org> Subject: Re: Positional Deletes in Java/Flink Andreas, We haven't built a strategy in the Flink sink that will use position deletes. The difficulty is that position deletes require knowing where the row that you want to delete is located, which means you either have to have expensive row-level indexing or you need to scan through potential data files to locate the rows to delete. Instead, the approach we've taken in Flink so far is to write out equality deletes that don't require knowing where the affected rows are located. Then, you can compact those deletes in the background to make access more efficient. Another alternative is to use Spark, which has MERGE, UPDATE, and DELETE plans. Those plans already need to find the affected rows, so there are plans that use position deletes (as well as plans that eagerly rewrite, or use a "copy-on-write" strategy). You can use those plans in microbatch to produce the results you're looking for. If you want to use position deletes, I'd recommend testing this out first to ensure that you get the performance you're looking for. It might be that fast deletes in Flink with an aggressive background compaction policy to apply them is better in the long term. Ryan On Mon, Jun 6, 2022 at 5:54 AM Hailu, Andreas <andreas.ha...@gs.com<mailto:andreas.ha...@gs.com>> wrote: Hi folks, I’m processing data from an Iceberg table with Flink and had a question about positional deletes. I batch process a source Table to create DataStream of Records that I’d like to delete from it. I initially created equality delete files with all the values from these Records, but for performance purposes I’d like to try out positional deletes. Given a Record from a Table, how can I go about identifying its position? A fellow from the Slack mentioned in Spark there’s a “_pos” metadata field but I haven’t found the equivalent in Java or Flink. best, ah ________________________________ Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices> -- Ryan Blue Tabular ________________________________ Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>