Re: [DISCUSS] v4 - One file commits

Fokko Driesprong Fri, 17 Oct 2025 21:06:15 -0700

Hey Amogh,

Thanks for the write-up. Unfortunately, I won’t be able to attend. Will it
be recorded? Thanks!


Kind regards,
Fokko

Op di 7 okt 2025 om 20:36 schreef Amogh Jahagirdar <[email protected]>

> Hey all,
>
> I've setup time this Friday at 9am PST for another sync on single file
> commits. In terms of what would be great to focus on for the discussion:
>
> 1. Whether it makes sense or not to eliminate the tuple, and instead
> representing the tuple via lower/upper boundaries. As a reminder, one of
> the goals is to avoid tying a partition spec to a manifest; in the root we
> can have a mix of files spanning different partition specs, and even in
> leaf manifests avoiding this coupling can enable more desirable clustering
> of metadata.
> In the vast majority of cases, we could leverage the property that a file
> is effectively partitioned if the lower/upper for a given field is equal.
> The nuance here is with the particular case of identity partitioned
> string/binary columns which can be truncated in stats. One approach is to
> require that writers must not produce truncated stats for identity
> partitioned columns. It's also important to keep in mind that all of this
> is just for the purpose of reconstructing the partition tuple, which is
> only required during equality delete matching. Another area we need to
> cover as part of this is on exact bounds on stats. There are other options
> here as well such as making all new equality deletes in V4 be global and
> instead match based on bounds, or keeping the tuple but each tuple is
> effectively based off a union schema of all partition specs. I am adding a
> separate appendix section outlining the span of options here and the
> different tradeoffs.
> Once we get this more to a conclusive state, I'll move a summarized
> version to the main doc.
>
> 2. @[email protected] <[email protected]> has updated the doc with
> a section
> <https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.rrpksmp8zkb#heading=h.qau0y5xkh9mn>
>  on
> how we can do change detection from the root in a variety of write
> scenarios. I've done a review on it, and it covers the cases I would
> expect. It'd be good for folks to take a look and please give feedback
> before we discuss. Thank you Steven for adding that section and all the
> diagrams.
>
> Thanks,
> Amogh Jahagirdar
>
> On Thu, Sep 18, 2025 at 3:19 PM Amogh Jahagirdar <[email protected]> wrote:
>
>> Hey folks just following up from the discussion last Friday with a
>> summary and some next steps:
>>
>> 1.) For the various change detection cases, we concluded it's best just
>> to go through those in an offline manner on the doc since it's hard to
>> verify all that correctness in a large meeting setting.
>> 2.) We mostly discussed eliminating the partition tuple. On the original
>> proposal, I was mostly aiming for the ability to re-constructing the tuple
>> from the stats for the purpose of equality delete matching (a file is
>> partitioned if the lower and upper bounds are equal); There's some nuance
>> in how we need to handle identity partition values since for string/binary
>> they cannot be truncated. Another potential option is to treat all equality
>> deletes as effectively global and narrow their application based on the
>> stats values. This may require defining tight bounds. I'm still collecting
>> my thoughts on this one.
>>
>> Thanks folks! Please also let me know if any of the following links are
>> inaccessible for any reason.
>>
>> Meeting recording link:
>> https://drive.google.com/file/d/1gv8TrR5xzqqNxek7_sTZkpbwQx1M3dhK/view
>> Meeting summary:
>> https://docs.google.com/document/d/131N0CDpzZczURxitN0HGS7dTqRxQT_YS9jMECkGGvQU
>>
>> On Mon, Sep 8, 2025 at 3:40 PM Amogh Jahagirdar <[email protected]> wrote:
>>
>>> Update: I moved the discussion time to this Friday at 9 am PST since I
>>> found out that quite a few folks involved in the proposals will be out next
>>> week, and I also know some folks will also be out the week after that.
>>>
>>> Thanks,
>>> Amogh J
>>>
>>> On Mon, Sep 8, 2025 at 8:57 AM Amogh Jahagirdar <[email protected]>
>>> wrote:
>>>
>>>> Hey folks sorry for the late follow up here,
>>>>
>>>> Thanks @Kevin Liu <[email protected]> for sharing the recording
>>>> link of the previous discussion! I've set up another sync for next Tuesday
>>>> 09/16 at 9am PST. This time I've set it up from my corporate email so we
>>>> can get recordings and transcriptions (and I've made sure to keep the
>>>> meeting invite open so we don't have to manually let people in).
>>>>
>>>> In terms of next steps of areas which I think would be good to focus on
>>>> for establishing consensus:
>>>>
>>>> 1. How do we model the manifest entry structure so that changes to
>>>> manifest DVs can be obtained easily from the root? There are a few options
>>>> here; the most promising approach is to keep an additional DV which encodes
>>>> the diff in additional positions which have been removed from a leaf
>>>> manifest.
>>>>
>>>> 2. Modeling partition transforms via expressions and establishing a
>>>> unified table ID space so that we can simplify how partition tuples may be
>>>> represented via stats and also have a way in the future to store stats on
>>>> any derived column. I have a short proposal
>>>> <https://docs.google.com/document/d/1oV8dapKVzB4pZy5pKHUCj5j9i2_1p37BJSeT7hyKPpg/edit?tab=t.0>
>>>>  for
>>>> this that probably still needs some tightening up on the expression
>>>> modeling itself (and some prototyping) but the general idea for
>>>> establishing a unified table ID space is covered. All feedback welcome!
>>>>
>>>> Thanks,
>>>>
>>>> Amogh Jahagirdar
>>>>
>>>> On Mon, Aug 25, 2025 at 1:34 PM Kevin Liu <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks Amogh. Looks like the recording for last week's sync is
>>>>> available on Youtube. Here's the link,
>>>>> https://www.youtube.com/watch?v=uWm-p--8oVQ
>>>>>
>>>>> Best,
>>>>> Kevin Liu
>>>>>
>>>>> On Tue, Aug 12, 2025 at 9:10 PM Amogh Jahagirdar <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hey folks,
>>>>>>
>>>>>> Just following up on this to give the community as to where we're at
>>>>>> and my proposed next steps.
>>>>>>
>>>>>> I've been editing and merging the contents from our proposal into the
>>>>>> proposal
>>>>>> <https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw>
>>>>>>  from
>>>>>> Russell and others. For any future comments on docs, please comment on 
>>>>>> the
>>>>>> linked proposal. I've also marked it on our doc in red text so it's clear
>>>>>> to redirect to the other proposal as a source of truth for comments.
>>>>>>
>>>>>> In terms of next steps,
>>>>>>
>>>>>> 1. An important design decision point is around inline manifest DVs,
>>>>>> external manifest DVs or enabling both. I'm working on measuring 
>>>>>> different
>>>>>> approaches for representing the compressed DV representation since that
>>>>>> will inform how many entries can reasonably fit in a small root manifest;
>>>>>> from that we can derive implications on different write patterns and
>>>>>> determine the right approach for storing these manifest DVs.
>>>>>>
>>>>>> 2. Another key point is around determining if/how we can reasonably
>>>>>> enable V4 to represent changes in the root manifest so that readers can
>>>>>> effectively just infer file level changes from the root.
>>>>>>
>>>>>> 3. One of the aspects of the proposal is getting away from partition
>>>>>> tuple requirement in the root which currently holds us to have
>>>>>> associativity between a partition spec and a manifest. These aspects can 
>>>>>> be
>>>>>> modeled as essentially column stats which gives a lot of flexibility into
>>>>>> the organization of the manifest. There are important details around 
>>>>>> field
>>>>>> ID spaces here which tie into how the stats are structured. What we're
>>>>>> proposing here is to have a unified expression ID space that could also
>>>>>> benefit us for storing things like virtual columns down the line. I go 
>>>>>> into
>>>>>> this in the proposal but I'm working on separating the appropriate parts 
>>>>>> so
>>>>>> that the original proposal can mostly just focus on the organization of 
>>>>>> the
>>>>>> content metadata tree and not how we want to solve this particular ID 
>>>>>> space
>>>>>> problem.
>>>>>>
>>>>>> 4. I'm planning on scheduling a recurring community sync starting
>>>>>> next Tuesday at 9am PST, every 2 weeks. If I get feedback from folks that
>>>>>> this time will never work, I can certainly adjust. For some reason, I 
>>>>>> don't
>>>>>> have the ability to add to the Iceberg Dev calendar, so I'll figure that
>>>>>> out and update the thread when the event is scheduled.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Amogh Jahagirdar
>>>>>>
>>>>>> On Tue, Jul 22, 2025 at 11:47 AM Russell Spitzer <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I think this is a great way forward, starting out with this much
>>>>>>> parallel development shows that we have a lot of consensus already :)
>>>>>>>
>>>>>>> On Tue, Jul 22, 2025 at 12:42 PM Amogh Jahagirdar <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey folks, just following up on this. It looks like our proposal
>>>>>>>> and the proposal that @Russell Spitzer <[email protected]> 
>>>>>>>> shared
>>>>>>>> are pretty aligned. I was just chatting with Russell about this, and we
>>>>>>>> think it'd be best to combine both proposals and have a singular large
>>>>>>>> effort on this. I can also set up a focused community discussion 
>>>>>>>> (similar
>>>>>>>> to what we're doing on the other V4 proposals) on this starting 
>>>>>>>> sometime
>>>>>>>> next week just to get things moving, if that works for people.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Amogh Jahagirdar
>>>>>>>>
>>>>>>>> On Mon, Jul 14, 2025 at 9:48 PM Amogh Jahagirdar <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Russell,
>>>>>>>>>
>>>>>>>>> Thanks for sharing the proposal! A few of us (Ryan, Dan, Anoop and
>>>>>>>>> I) have also been working on a proposal for an adaptive metadata tree
>>>>>>>>> structure as part of enabling more efficient one file commits. From a 
>>>>>>>>> read
>>>>>>>>> of the summary, it's great to see that we're thinking along the same 
>>>>>>>>> lines
>>>>>>>>> about how to tackle this fundamental area!
>>>>>>>>>
>>>>>>>>> Here is our proposal:
>>>>>>>>> https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0
>>>>>>>>> <https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>
>>>>>>>>> On Mon, Jul 14, 2025 at 8:08 PM Russell Spitzer <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hey y'all!
>>>>>>>>>>
>>>>>>>>>> We (Yi Fang, Steven Wu and Myself) wanted to share some
>>>>>>>>>> of the thoughts we had on how one-file commits could work in
>>>>>>>>>> Iceberg. This is pretty
>>>>>>>>>> much just a high level overview of the concepts we think we need
>>>>>>>>>> and how Iceberg would behave.
>>>>>>>>>> We haven't gone very far into the actual implementation and
>>>>>>>>>> changes that would need to occur in the
>>>>>>>>>> SDK to make this happen.
>>>>>>>>>>
>>>>>>>>>> The high level summary is:
>>>>>>>>>>
>>>>>>>>>> Manifest Lists are out
>>>>>>>>>> Root Manifests take their place
>>>>>>>>>>   A Root manifest can have data manifests, delete manifests,
>>>>>>>>>> manifest delete vectors, data delete vectors and data files
>>>>>>>>>>   Manifest delete vectors allow for modifying a manifest without
>>>>>>>>>> deleting it entirely
>>>>>>>>>>   Data files let you append without writing an intermediary
>>>>>>>>>> manifest
>>>>>>>>>>   Having child data and delete manifests lets you still scale
>>>>>>>>>>
>>>>>>>>>> Please take a look if you like,
>>>>>>>>>>
>>>>>>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0
>>>>>>>>>>
>>>>>>>>>> I'm excited to see what other proposals and Ideas are floating
>>>>>>>>>> around the community,
>>>>>>>>>> Russ
>>>>>>>>>>
>>>>>>>>>> On Wed, Jul 2, 2025 at 6:29 PM John Zhuge <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Very excited about the idea!
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jul 2, 2025 at 1:17 PM Anoop Johnson <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I'm very interested in this initiative. Micah Kornfield and I
>>>>>>>>>>>> presented
>>>>>>>>>>>> <https://youtu.be/4d4nqKkANdM?si=9TXgaUIXbq-l8idi&t=1405> on
>>>>>>>>>>>> high-throughput ingestion for Iceberg tables at the 2024 Iceberg 
>>>>>>>>>>>> Summit,
>>>>>>>>>>>> which leveraged Google infrastructure like Colossus for efficient 
>>>>>>>>>>>> appends.
>>>>>>>>>>>>
>>>>>>>>>>>> This new proposal is particularly exciting because it offers
>>>>>>>>>>>> significant advancements in commit latency and metadata storage 
>>>>>>>>>>>> footprint.
>>>>>>>>>>>> Furthermore, a consistent manifest structure promises to simplify 
>>>>>>>>>>>> the
>>>>>>>>>>>> design and codebase, which is a major benefit.
>>>>>>>>>>>>
>>>>>>>>>>>> A related idea I've been exploring is having a loose affinity
>>>>>>>>>>>> between data and delete manifests. While the current separation of 
>>>>>>>>>>>> data and
>>>>>>>>>>>> delete manifests in Iceberg is valuable for avoiding data file 
>>>>>>>>>>>> rewrites
>>>>>>>>>>>> (and stats updates) when deletes change, it does necessitate a join
>>>>>>>>>>>> operation during reads. I'd be keen to discuss approaches that 
>>>>>>>>>>>> could
>>>>>>>>>>>> potentially reduce this read-side cost while retaining the 
>>>>>>>>>>>> benefits of
>>>>>>>>>>>> separate manifests.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Anoop
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jun 13, 2025 at 11:06 AM Jagdeep Sidhu <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am new to the Iceberg community but would love to
>>>>>>>>>>>>> participate in these discussions to reduce the number of file 
>>>>>>>>>>>>> writes,
>>>>>>>>>>>>> especially for small writes/commits.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you!
>>>>>>>>>>>>> -Jagdeep
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Jun 5, 2025 at 4:02 PM Anurag Mantripragada
>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> We have been hitting all the metadata problems you mentioned,
>>>>>>>>>>>>>> Ryan. I’m on-board to help however I can to improve this area.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ~ Anurag Mantripragada
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jun 3, 2025, at 2:22 AM, Huang-Hsiang Cheng
>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am interested in this idea and looking forward to
>>>>>>>>>>>>>> collaboration.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Huang-Hsiang
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jun 2, 2025, at 10:14 AM, namratha mk <[email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am interested in contributing to this effort.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Namratha
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, May 29, 2025 at 1:36 PM Amogh Jahagirdar <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for kicking this thread off Ryan, I'm interested in
>>>>>>>>>>>>>>> helping out here! I've been working on a proposal in this area 
>>>>>>>>>>>>>>> and it would
>>>>>>>>>>>>>>> be great to collaborate with different folks and exchange ideas 
>>>>>>>>>>>>>>> here, since
>>>>>>>>>>>>>>> I think a lot of people are interested in solving this problem.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, May 29, 2025 at 2:25 PM Ryan Blue <[email protected]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Like Russell’s recent note, I’m starting a thread to
>>>>>>>>>>>>>>>> connect those of us that are interested in the idea of 
>>>>>>>>>>>>>>>> changing Iceberg’s
>>>>>>>>>>>>>>>> metadata in v4 so that in most cases committing a change only 
>>>>>>>>>>>>>>>> requires
>>>>>>>>>>>>>>>> writing one additional metadata file.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *Idea: One-file commits*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The current Iceberg metadata structure requires writing at
>>>>>>>>>>>>>>>> least one manifest and a new manifest list to produce a new 
>>>>>>>>>>>>>>>> snapshot. The
>>>>>>>>>>>>>>>> goal of this work is to allow more flexibility by allowing the 
>>>>>>>>>>>>>>>> manifest
>>>>>>>>>>>>>>>> list layer to store data and delete files. As a result, only 
>>>>>>>>>>>>>>>> one file write
>>>>>>>>>>>>>>>> would be needed before committing the new snapshot. In 
>>>>>>>>>>>>>>>> addition, this work
>>>>>>>>>>>>>>>> will also try to explore:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    - Avoiding small manifests that must be read in
>>>>>>>>>>>>>>>>    parallel and later compacted (metadata maintenance changes)
>>>>>>>>>>>>>>>>    - Extend metadata skipping to use aggregated column
>>>>>>>>>>>>>>>>    ranges that are compatible with geospatial data (manifest 
>>>>>>>>>>>>>>>> metadata)
>>>>>>>>>>>>>>>>    - Using soft deletes to avoid rewriting existing
>>>>>>>>>>>>>>>>    manifests (metadata DVs)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If you’re interested in these problems, please reply!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> John Zhuge
>>>>>>>>>>>
>>>>>>>>>>

Re: [DISCUSS] v4 - One file commits

Reply via email to