>
> *Upserts/Deletes*
>>
>> I have jobs that apply upserts/deletes to datasets. My current approach
>> is:
>>
>> 1. calculate the affected partitions (collected in the Driver)
>> 2. load up the previous versions of all of those partitions as a
>> DataFrame
>> 3. apply the upserts/deletes
>> 4. write out the new versions of the affected partitions (containing
>> old data plus/minus the upserts/deletes)
>> 5. update my index file
>>
>> How is this intended to be done in Iceberg? I see that there are a bunch
>> of Table operations. Would it be up to me to still do steps 1-4 and then
>> rely on Iceberg to do step 5 using the table operations?
>>
>
> Currently, you can delete data by reading, filtering, and overwriting what
> you read. That's an atomic operation so it is safe to read and overwrite
> the same data.
>
Does such an operation take advantage of partitioning to minimize write
amplification? For example, let's say I do something like this:
path = 'some_path'
df = read(path)
df = (
df
.join(keys_to_delete, on=['partition_col'], how='anti')
.union(upserts)
)
df.write(path)
Is this going to result in scanning and rewriting the entire dataset even
if the keys to delete are in a small subset of partitions? I would imagine
so. What is the correct way to do this?