Re: Changing default delete file granularity for Spark writes from partition to file scoped

Russell Spitzer Thu, 02 Jan 2025 07:38:19 -0800

I think that makes sense to do, I'll do a review this morning

On Mon, Dec 23, 2024 at 9:53 AM Amogh Jahagirdar <2am...@gmail.com> wrote:


> Hey all, it's been a while but I wanted to follow up on this thread in
> case there were any remaining thoughts on changing the default write
> granularity in Spark to be file scoped.
>
> Given the degenerate amplification case for partition scoped deletes and
> the additional benefits outlined earlier, I think it would be great if we
> could move towards this.
>
> I'd also like to point out that the latest version of my PR [1] changes
> the default to file-scoped deletes for all Spark 3.5 writes via
> SparkWriteConf rather than just new V2 tables. It's potentially a wider
> impacting change but considering the benefits mentioned earlier in the
> thread it seems worth it. In the chance that any particular job has issues
> with this default change, the conf can be overridden.
>
> Please let me know what you think!
>
> [1] https://github.com/apache/iceberg/pull/11478
>
> Thanks,
>
> Amogh Jahagirdar
>
> On Tue, Nov 26, 2024 at 3:54 PM Amogh Jahagirdar <2am...@gmail.com> wrote:
>
>>
>>
>> Just following up on this thread,
>>
>> Getting numbers for various table layouts is involved and I think it
>> would instead be helpful to look at a degenerate read amplification case
>> arising from partition-scoped deletes.
>> Furthermore, there are some additional benefits to file scope deletes
>> that are worth clarifying that benchmarking alone won't capture.
>>
>>
>> *Degenerate Read Amplification Case for Partition Scoped Deletes*
>> Consider a table with id and data columns, partitioned with bucket(10,
>> id) and 10 data files per bucket. With partition scoped deletes, if a
>> delete affects only two files within a bucket, any read touching the 8
>> unaffected files in a partition will unnecessarily fetch the delete file.
>> Generalized: Say there are D data files per partition, W writes performed,
>> and I impacted data files for deletes, where I << D
>>
>>    - Partition-scoped: O(D * W) delete file reads in worst case for a
>>    given partition because each write will produce a partition scoped delete
>>    which would need to be read for each data file.
>>    - File-scoped: O(I) delete file reads in worst case for a given
>>    partition
>>
>> The key above is how read amplification with partition scoped deletes can
>> increase with every write that's performed, and this is further compounded
>> in aggregate by how many partitions are impacted as well.
>> The file-scoped deletes that need to be read scale independently of the
>> number of writes that are performed since they're targeted per data file
>> and each delete is being maintained on write to make sure there's not
>> multiple delete files for a given data file.
>>
>> *Additional benefits to file scoped deletes*:
>>
>>    - Now that file scoped deletes are being maintained as part of
>>    writes, old deletes will be removed from storage. With partition scoped
>>    deletes, the delete file cannot even be removed even if it's replaced for 
>> a
>>    given data file since there can be deletes for other data files.
>>    - Moving towards file scoped deletes will help avoid unnecessary
>>    conflicts with concurrent compactions/other writes.
>>    - There are other engines which already write file scoped deletes,
>>    Anton mentioned Flink earlier in the thread, which also ties into point 1
>>    above, since much of the goal of writing file scoped deletes was to avoid
>>    unnecessary conflicts with concurrent compactions. Trino
>>    
>> <https://github.com/trinodb/trino/blob/332a658f3a292f79da203a131d2e25b2418d5279/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/PositionDeleteWriter.java#L73>
>>    is another engine which writes file scoped position deletes as of today.
>>    - We know that V3 deletion vectors will be an improvement in the
>>    majority of cases when compared to existing position deletes. Considering
>>    this and the fact that file-scoped deletes are closer to DVs as Anton
>>    mentioned, moving towards file scoped deletes makes it easier to migrate
>>    from the file-scoped deletes to DVs.
>>
>> It's worth clarifying that in the case there is some regression for a
>> given workload it's always possible to set the property back to partition
>> scoped.
>>
>> Thanks,
>> Amogh Jahagirdar
>>
>> On Fri, Nov 15, 2024 at 11:54 AM Amogh Jahagirdar <2am...@gmail.com>
>> wrote:
>>
>>> Following up on this thread,
>>>
>>> > I don't think this is a bad idea from a theoretical perspective. Do we
>>> have any actual numbers to back up the change?
>>> There are no numbers yet, changing the default is largely driven by the
>>> fact that the previous downside of file scoped deletes leading to many
>>> files on disk, is now mitigated by Spark sync maintenance.
>>>
>>> To get some more numbers, I think we'll first need to update some of our
>>> benchmarks to be more representative of typical MoR workloads. For example
>>> DeleteFileIndexBenchmark currently tests tables with a small amount of
>>> partitions with many data files and very few delete files, so this leads to
>>> a huge difference in files between partition scoped and file scoped for
>>> that benchmark and doesn't display the benefits of moving to file scoped
>>> deletes by default. I'll look into updating some of these benchmarks.
>>>
>>> Thanks,
>>>
>>> Amogh Jahagirdar
>>>
>>> On Tue, Nov 12, 2024 at 12:36 PM Anton Okolnychyi <aokolnyc...@gmail.com>
>>> wrote:
>>>
>>>> I support the idea of switching to file-scoped deletes for new tables.
>>>> The absence of sync maintenance prevented us from doing that earlier. Given
>>>> that Amogh recently merged that functionality into main, we should be fine.
>>>> Flink connectors have been using file-scoped deletes for a while now to
>>>> solve issues with concurrent compactions.
>>>>
>>>> File-scoped deletes are closer to DVs and even share some code like
>>>> assignment by path and sync maintenance in Spark. This migration will allow
>>>> us to test many concepts that are needed for DVs prior to completing the V3
>>>> spec. It is also easier to migrate file-scoped deletes to DVs in V3. Unlike
>>>> partition-scoped deletes, they can be discarded from the table state once a
>>>> DV is produced for that data file.
>>>>
>>>> - Anton
>>>>
>>>>
>>>>
>>>> пн, 11 лист. 2024 р. о 21:01 Russell Spitzer <russell.spit...@gmail.com>
>>>> пише:
>>>>
>>>>> I don't think this is a bad idea from a theoretical perspective. Do we
>>>>> have any actual numbers to back up the change? I would think for most 
>>>>> folks
>>>>> we would recommend just going to V3 rather than changing granularity for
>>>>> their new tables. It would just affect new tables though so I'm not 
>>>>> opposed
>>>>> to the change except for on the grounds that we are changing a default.
>>>>>
>>>>> On Mon, Nov 11, 2024 at 1:56 PM Amogh Jahagirdar <2am...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I wanted to discuss changing the default position delete file
>>>>>> granularity for Spark from partition to file level for any newly created 
>>>>>> V2
>>>>>> tables. See this PR [1]
>>>>>>
>>>>>> Context on delete file granularity:
>>>>>>
>>>>>>    - Partition granularity: Writers group delete files for multiple
>>>>>>    data files from the same partition into the same delete file. This 
>>>>>> leads to
>>>>>>    fewer files on disk, but higher read amplification from reading delete
>>>>>>    information from irrelevant data files for a scan.
>>>>>>    - File granularity: Writers write a new delete file for every
>>>>>>    changed data file. More targeted reads of relevant delete information 
>>>>>> occur
>>>>>>    but this can lead to more files on disk.
>>>>>>
>>>>>> With the recent merge of synchronous position delete maintenance on
>>>>>> write in Spark [2], file granularity as a default is more compelling 
>>>>>> since
>>>>>> reads would be more targeted *and* files would be maintained on
>>>>>> disk. I also recommend folks go through the deletion vector design doc 
>>>>>> for
>>>>>> more details [3].
>>>>>>
>>>>>> Note that for existing tables with high delete-to-data file ratios,
>>>>>> Iceberg's rewrite position deletes procedure can compact the table and
>>>>>> every subsequent write would continuously maintain the position deletes.
>>>>>> Additionally note that in V3, at most one puffin position delete file is
>>>>>> allowed per data file; what's being discussed here is changing the 
>>>>>> default
>>>>>> granularity for new V2 tables since it should generally be better after 
>>>>>> the
>>>>>> sync maintenance addition.
>>>>>>
>>>>>> What are folks' thoughts on this?
>>>>>>
>>>>>> [1] https://github.com/apache/iceberg/pull/11478
>>>>>> [2] https://github.com/apache/iceberg/pull/11273
>>>>>> [3]
>>>>>> https://docs.google.com/document/d/18Bqhr-vnzFfQk1S4AgRISkA_5_m5m32Nnc2Cw0zn2XM/edit?tab=t.0#heading=h.193fl7s89tcg
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Amogh Jahagirdar
>>>>>>
>>>>>

Re: Changing default delete file granularity for Spark writes from partition to file scoped

Reply via email to