Re: Changing default delete file granularity for Spark writes from partition to file scoped

Amogh Jahagirdar Tue, 26 Nov 2024 14:55:12 -0800

Just following up on this thread,

Getting numbers for various table layouts is involved and I think it would
instead be helpful to look at a degenerate read amplification case arising
from partition-scoped deletes.
Furthermore, there are some additional benefits to file scope deletes that
are worth clarifying that benchmarking alone won't capture.



*Degenerate Read Amplification Case for Partition Scoped Deletes*
Consider a table with id and data columns, partitioned with bucket(10, id)
and 10 data files per bucket. With partition scoped deletes, if a delete
affects only two files within a bucket, any read touching the 8 unaffected
files in a partition will unnecessarily fetch the delete file. Generalized:
Say there are D data files per partition, W writes performed, and I
impacted data files for deletes, where I << D

   - Partition-scoped: O(D * W) delete file reads in worst case for a given
   partition because each write will produce a partition scoped delete which
   would need to be read for each data file.
   - File-scoped: O(I) delete file reads in worst case for a given partition

The key above is how read amplification with partition scoped deletes can
increase with every write that's performed, and this is further compounded
in aggregate by how many partitions are impacted as well.
The file-scoped deletes that need to be read scale independently of the
number of writes that are performed since they're targeted per data file
and each delete is being maintained on write to make sure there's not
multiple delete files for a given data file.

*Additional benefits to file scoped deletes*:

   - Now that file scoped deletes are being maintained as part of writes,
   old deletes will be removed from storage. With partition scoped deletes,
   the delete file cannot even be removed even if it's replaced for a given
   data file since there can be deletes for other data files.
   - Moving towards file scoped deletes will help avoid unnecessary
   conflicts with concurrent compactions/other writes.
   - There are other engines which already write file scoped deletes, Anton
   mentioned Flink earlier in the thread, which also ties into point 1 above,
   since much of the goal of writing file scoped deletes was to avoid
   unnecessary conflicts with concurrent compactions. Trino
   
<https://github.com/trinodb/trino/blob/332a658f3a292f79da203a131d2e25b2418d5279/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/PositionDeleteWriter.java#L73>
   is another engine which writes file scoped position deletes as of today.
   - We know that V3 deletion vectors will be an improvement in the
   majority of cases when compared to existing position deletes. Considering
   this and the fact that file-scoped deletes are closer to DVs as Anton
   mentioned, moving towards file scoped deletes makes it easier to migrate
   from the file-scoped deletes to DVs.

It's worth clarifying that in the case there is some regression for a given
workload it's always possible to set the property back to partition scoped.

Thanks,
Amogh Jahagirdar

On Fri, Nov 15, 2024 at 11:54 AM Amogh Jahagirdar <[email protected]> wrote:

> Following up on this thread,
>
> > I don't think this is a bad idea from a theoretical perspective. Do we
> have any actual numbers to back up the change?
> There are no numbers yet, changing the default is largely driven by the
> fact that the previous downside of file scoped deletes leading to many
> files on disk, is now mitigated by Spark sync maintenance.
>
> To get some more numbers, I think we'll first need to update some of our
> benchmarks to be more representative of typical MoR workloads. For example
> DeleteFileIndexBenchmark currently tests tables with a small amount of
> partitions with many data files and very few delete files, so this leads to
> a huge difference in files between partition scoped and file scoped for
> that benchmark and doesn't display the benefits of moving to file scoped
> deletes by default. I'll look into updating some of these benchmarks.
>
> Thanks,
>
> Amogh Jahagirdar
>
> On Tue, Nov 12, 2024 at 12:36 PM Anton Okolnychyi <[email protected]>
> wrote:
>
>> I support the idea of switching to file-scoped deletes for new tables.
>> The absence of sync maintenance prevented us from doing that earlier. Given
>> that Amogh recently merged that functionality into main, we should be fine.
>> Flink connectors have been using file-scoped deletes for a while now to
>> solve issues with concurrent compactions.
>>
>> File-scoped deletes are closer to DVs and even share some code like
>> assignment by path and sync maintenance in Spark. This migration will allow
>> us to test many concepts that are needed for DVs prior to completing the V3
>> spec. It is also easier to migrate file-scoped deletes to DVs in V3. Unlike
>> partition-scoped deletes, they can be discarded from the table state once a
>> DV is produced for that data file.
>>
>> - Anton
>>
>>
>>
>> пн, 11 лист. 2024 р. о 21:01 Russell Spitzer <[email protected]>
>> пише:
>>
>>> I don't think this is a bad idea from a theoretical perspective. Do we
>>> have any actual numbers to back up the change? I would think for most folks
>>> we would recommend just going to V3 rather than changing granularity for
>>> their new tables. It would just affect new tables though so I'm not opposed
>>> to the change except for on the grounds that we are changing a default.
>>>
>>> On Mon, Nov 11, 2024 at 1:56 PM Amogh Jahagirdar <[email protected]>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I wanted to discuss changing the default position delete file
>>>> granularity for Spark from partition to file level for any newly created V2
>>>> tables. See this PR [1]
>>>>
>>>> Context on delete file granularity:
>>>>
>>>>    - Partition granularity: Writers group delete files for multiple
>>>>    data files from the same partition into the same delete file. This 
>>>> leads to
>>>>    fewer files on disk, but higher read amplification from reading delete
>>>>    information from irrelevant data files for a scan.
>>>>    - File granularity: Writers write a new delete file for every
>>>>    changed data file. More targeted reads of relevant delete information 
>>>> occur
>>>>    but this can lead to more files on disk.
>>>>
>>>> With the recent merge of synchronous position delete maintenance on
>>>> write in Spark [2], file granularity as a default is more compelling since
>>>> reads would be more targeted *and* files would be maintained on disk.
>>>> I also recommend folks go through the deletion vector design doc for more
>>>> details [3].
>>>>
>>>> Note that for existing tables with high delete-to-data file ratios,
>>>> Iceberg's rewrite position deletes procedure can compact the table and
>>>> every subsequent write would continuously maintain the position deletes.
>>>> Additionally note that in V3, at most one puffin position delete file is
>>>> allowed per data file; what's being discussed here is changing the default
>>>> granularity for new V2 tables since it should generally be better after the
>>>> sync maintenance addition.
>>>>
>>>> What are folks' thoughts on this?
>>>>
>>>> [1] https://github.com/apache/iceberg/pull/11478
>>>> [2] https://github.com/apache/iceberg/pull/11273
>>>> [3]
>>>> https://docs.google.com/document/d/18Bqhr-vnzFfQk1S4AgRISkA_5_m5m32Nnc2Cw0zn2XM/edit?tab=t.0#heading=h.193fl7s89tcg
>>>>
>>>> Thanks,
>>>>
>>>> Amogh Jahagirdar
>>>>
>>>

Re: Changing default delete file granularity for Spark writes from partition to file scoped

Reply via email to