Thanks, Ryan, that’s helpful.

I’m curious – is there a Flink-native means to write delete files, then? I’ve 
written mine through the native Java AP e.g. creating an EqualityDeleteWriter I 
and applied them to the ‘main’ Table in a transaction.

ah

From: Ryan Blue <b...@tabular.io>
Sent: Monday, June 6, 2022 1:52 PM
To: Iceberg Dev List <dev@iceberg.apache.org>
Subject: Re: Positional Deletes in Java/Flink

Andreas,

We haven't built a strategy in the Flink sink that will use position deletes. 
The difficulty is that position deletes require knowing where the row that you 
want to delete is located, which means you either have to have expensive 
row-level indexing or you need to scan through potential data files to locate 
the rows to delete. Instead, the approach we've taken in Flink so far is to 
write out equality deletes that don't require knowing where the affected rows 
are located. Then, you can compact those deletes in the background to make 
access more efficient.

Another alternative is to use Spark, which has MERGE, UPDATE, and DELETE plans. 
Those plans already need to find the affected rows, so there are plans that use 
position deletes (as well as plans that eagerly rewrite, or use a 
"copy-on-write" strategy). You can use those plans in microbatch to produce the 
results you're looking for. If you want to use position deletes, I'd recommend 
testing this out first to ensure that you get the performance you're looking 
for. It might be that fast deletes in Flink with an aggressive background 
compaction policy to apply them is better in the long term.

Ryan

On Mon, Jun 6, 2022 at 5:54 AM Hailu, Andreas 
<andreas.ha...@gs.com<mailto:andreas.ha...@gs.com>> wrote:
Hi folks, I’m processing data from an Iceberg table with Flink and had a 
question about positional deletes.

I batch process a source Table to create DataStream of Records that I’d like to 
delete from it. I initially created equality delete files with all the values 
from these Records, but for performance purposes I’d like to try out positional 
deletes. Given a Record from a Table, how can I go about identifying its 
position? A fellow from the Slack mentioned in Spark there’s a “_pos” metadata 
field but I haven’t found the equivalent in Java or Flink.

best,
ah


________________________________

Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>


--
Ryan Blue
Tabular

________________________________

Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>

Reply via email to