I think that makes sense to do, I'll do a review this morning On Mon, Dec 23, 2024 at 9:53 AM Amogh Jahagirdar <2am...@gmail.com> wrote:
> Hey all, it's been a while but I wanted to follow up on this thread in > case there were any remaining thoughts on changing the default write > granularity in Spark to be file scoped. > > Given the degenerate amplification case for partition scoped deletes and > the additional benefits outlined earlier, I think it would be great if we > could move towards this. > > I'd also like to point out that the latest version of my PR [1] changes > the default to file-scoped deletes for all Spark 3.5 writes via > SparkWriteConf rather than just new V2 tables. It's potentially a wider > impacting change but considering the benefits mentioned earlier in the > thread it seems worth it. In the chance that any particular job has issues > with this default change, the conf can be overridden. > > Please let me know what you think! > > [1] https://github.com/apache/iceberg/pull/11478 > > Thanks, > > Amogh Jahagirdar > > On Tue, Nov 26, 2024 at 3:54 PM Amogh Jahagirdar <2am...@gmail.com> wrote: > >> >> >> Just following up on this thread, >> >> Getting numbers for various table layouts is involved and I think it >> would instead be helpful to look at a degenerate read amplification case >> arising from partition-scoped deletes. >> Furthermore, there are some additional benefits to file scope deletes >> that are worth clarifying that benchmarking alone won't capture. >> >> >> *Degenerate Read Amplification Case for Partition Scoped Deletes* >> Consider a table with id and data columns, partitioned with bucket(10, >> id) and 10 data files per bucket. With partition scoped deletes, if a >> delete affects only two files within a bucket, any read touching the 8 >> unaffected files in a partition will unnecessarily fetch the delete file. >> Generalized: Say there are D data files per partition, W writes performed, >> and I impacted data files for deletes, where I << D >> >> - Partition-scoped: O(D * W) delete file reads in worst case for a >> given partition because each write will produce a partition scoped delete >> which would need to be read for each data file. >> - File-scoped: O(I) delete file reads in worst case for a given >> partition >> >> The key above is how read amplification with partition scoped deletes can >> increase with every write that's performed, and this is further compounded >> in aggregate by how many partitions are impacted as well. >> The file-scoped deletes that need to be read scale independently of the >> number of writes that are performed since they're targeted per data file >> and each delete is being maintained on write to make sure there's not >> multiple delete files for a given data file. >> >> *Additional benefits to file scoped deletes*: >> >> - Now that file scoped deletes are being maintained as part of >> writes, old deletes will be removed from storage. With partition scoped >> deletes, the delete file cannot even be removed even if it's replaced for >> a >> given data file since there can be deletes for other data files. >> - Moving towards file scoped deletes will help avoid unnecessary >> conflicts with concurrent compactions/other writes. >> - There are other engines which already write file scoped deletes, >> Anton mentioned Flink earlier in the thread, which also ties into point 1 >> above, since much of the goal of writing file scoped deletes was to avoid >> unnecessary conflicts with concurrent compactions. Trino >> >> <https://github.com/trinodb/trino/blob/332a658f3a292f79da203a131d2e25b2418d5279/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/PositionDeleteWriter.java#L73> >> is another engine which writes file scoped position deletes as of today. >> - We know that V3 deletion vectors will be an improvement in the >> majority of cases when compared to existing position deletes. Considering >> this and the fact that file-scoped deletes are closer to DVs as Anton >> mentioned, moving towards file scoped deletes makes it easier to migrate >> from the file-scoped deletes to DVs. >> >> It's worth clarifying that in the case there is some regression for a >> given workload it's always possible to set the property back to partition >> scoped. >> >> Thanks, >> Amogh Jahagirdar >> >> On Fri, Nov 15, 2024 at 11:54 AM Amogh Jahagirdar <2am...@gmail.com> >> wrote: >> >>> Following up on this thread, >>> >>> > I don't think this is a bad idea from a theoretical perspective. Do we >>> have any actual numbers to back up the change? >>> There are no numbers yet, changing the default is largely driven by the >>> fact that the previous downside of file scoped deletes leading to many >>> files on disk, is now mitigated by Spark sync maintenance. >>> >>> To get some more numbers, I think we'll first need to update some of our >>> benchmarks to be more representative of typical MoR workloads. For example >>> DeleteFileIndexBenchmark currently tests tables with a small amount of >>> partitions with many data files and very few delete files, so this leads to >>> a huge difference in files between partition scoped and file scoped for >>> that benchmark and doesn't display the benefits of moving to file scoped >>> deletes by default. I'll look into updating some of these benchmarks. >>> >>> Thanks, >>> >>> Amogh Jahagirdar >>> >>> On Tue, Nov 12, 2024 at 12:36 PM Anton Okolnychyi <aokolnyc...@gmail.com> >>> wrote: >>> >>>> I support the idea of switching to file-scoped deletes for new tables. >>>> The absence of sync maintenance prevented us from doing that earlier. Given >>>> that Amogh recently merged that functionality into main, we should be fine. >>>> Flink connectors have been using file-scoped deletes for a while now to >>>> solve issues with concurrent compactions. >>>> >>>> File-scoped deletes are closer to DVs and even share some code like >>>> assignment by path and sync maintenance in Spark. This migration will allow >>>> us to test many concepts that are needed for DVs prior to completing the V3 >>>> spec. It is also easier to migrate file-scoped deletes to DVs in V3. Unlike >>>> partition-scoped deletes, they can be discarded from the table state once a >>>> DV is produced for that data file. >>>> >>>> - Anton >>>> >>>> >>>> >>>> пн, 11 лист. 2024 р. о 21:01 Russell Spitzer <russell.spit...@gmail.com> >>>> пише: >>>> >>>>> I don't think this is a bad idea from a theoretical perspective. Do we >>>>> have any actual numbers to back up the change? I would think for most >>>>> folks >>>>> we would recommend just going to V3 rather than changing granularity for >>>>> their new tables. It would just affect new tables though so I'm not >>>>> opposed >>>>> to the change except for on the grounds that we are changing a default. >>>>> >>>>> On Mon, Nov 11, 2024 at 1:56 PM Amogh Jahagirdar <2am...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I wanted to discuss changing the default position delete file >>>>>> granularity for Spark from partition to file level for any newly created >>>>>> V2 >>>>>> tables. See this PR [1] >>>>>> >>>>>> Context on delete file granularity: >>>>>> >>>>>> - Partition granularity: Writers group delete files for multiple >>>>>> data files from the same partition into the same delete file. This >>>>>> leads to >>>>>> fewer files on disk, but higher read amplification from reading delete >>>>>> information from irrelevant data files for a scan. >>>>>> - File granularity: Writers write a new delete file for every >>>>>> changed data file. More targeted reads of relevant delete information >>>>>> occur >>>>>> but this can lead to more files on disk. >>>>>> >>>>>> With the recent merge of synchronous position delete maintenance on >>>>>> write in Spark [2], file granularity as a default is more compelling >>>>>> since >>>>>> reads would be more targeted *and* files would be maintained on >>>>>> disk. I also recommend folks go through the deletion vector design doc >>>>>> for >>>>>> more details [3]. >>>>>> >>>>>> Note that for existing tables with high delete-to-data file ratios, >>>>>> Iceberg's rewrite position deletes procedure can compact the table and >>>>>> every subsequent write would continuously maintain the position deletes. >>>>>> Additionally note that in V3, at most one puffin position delete file is >>>>>> allowed per data file; what's being discussed here is changing the >>>>>> default >>>>>> granularity for new V2 tables since it should generally be better after >>>>>> the >>>>>> sync maintenance addition. >>>>>> >>>>>> What are folks' thoughts on this? >>>>>> >>>>>> [1] https://github.com/apache/iceberg/pull/11478 >>>>>> [2] https://github.com/apache/iceberg/pull/11273 >>>>>> [3] >>>>>> https://docs.google.com/document/d/18Bqhr-vnzFfQk1S4AgRISkA_5_m5m32Nnc2Cw0zn2XM/edit?tab=t.0#heading=h.193fl7s89tcg >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Amogh Jahagirdar >>>>>> >>>>>