Re: [DISCUSS] v4 - One file commits

Amogh Jahagirdar Sat, 18 Oct 2025 05:26:46 -0700

Hey Fokko,

Yes! Will be recorded and transcribed


On Fri, Oct 10, 2025 at 9:38 AM Fokko Driesprong <[email protected]> wrote:

> Hey Amogh,
>
> Thanks for the write-up. Unfortunately, I won’t be able to attend. Will it
> be recorded? Thanks!
>
> Kind regards,
> Fokko
>
> Op di 7 okt 2025 om 20:36 schreef Amogh Jahagirdar <[email protected]>
>
>> Hey all,
>>
>> I've setup time this Friday at 9am PST for another sync on single file
>> commits. In terms of what would be great to focus on for the discussion:
>>
>> 1. Whether it makes sense or not to eliminate the tuple, and instead
>> representing the tuple via lower/upper boundaries. As a reminder, one of
>> the goals is to avoid tying a partition spec to a manifest; in the root we
>> can have a mix of files spanning different partition specs, and even in
>> leaf manifests avoiding this coupling can enable more desirable clustering
>> of metadata.
>> In the vast majority of cases, we could leverage the property that a file
>> is effectively partitioned if the lower/upper for a given field is equal.
>> The nuance here is with the particular case of identity partitioned
>> string/binary columns which can be truncated in stats. One approach is to
>> require that writers must not produce truncated stats for identity
>> partitioned columns. It's also important to keep in mind that all of this
>> is just for the purpose of reconstructing the partition tuple, which is
>> only required during equality delete matching. Another area we need to
>> cover as part of this is on exact bounds on stats. There are other options
>> here as well such as making all new equality deletes in V4 be global and
>> instead match based on bounds, or keeping the tuple but each tuple is
>> effectively based off a union schema of all partition specs. I am adding a
>> separate appendix section outlining the span of options here and the
>> different tradeoffs.
>> Once we get this more to a conclusive state, I'll move a summarized
>> version to the main doc.
>>
>> 2. @[email protected] <[email protected]> has updated the doc with
>> a section
>> <https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.rrpksmp8zkb#heading=h.qau0y5xkh9mn>
>>  on
>> how we can do change detection from the root in a variety of write
>> scenarios. I've done a review on it, and it covers the cases I would
>> expect. It'd be good for folks to take a look and please give feedback
>> before we discuss. Thank you Steven for adding that section and all the
>> diagrams.
>>
>> Thanks,
>> Amogh Jahagirdar
>>
>> On Thu, Sep 18, 2025 at 3:19 PM Amogh Jahagirdar <[email protected]>
>> wrote:
>>
>>> Hey folks just following up from the discussion last Friday with a
>>> summary and some next steps:
>>>
>>> 1.) For the various change detection cases, we concluded it's best just
>>> to go through those in an offline manner on the doc since it's hard to
>>> verify all that correctness in a large meeting setting.
>>> 2.) We mostly discussed eliminating the partition tuple. On the original
>>> proposal, I was mostly aiming for the ability to re-constructing the tuple
>>> from the stats for the purpose of equality delete matching (a file is
>>> partitioned if the lower and upper bounds are equal); There's some nuance
>>> in how we need to handle identity partition values since for string/binary
>>> they cannot be truncated. Another potential option is to treat all equality
>>> deletes as effectively global and narrow their application based on the
>>> stats values. This may require defining tight bounds. I'm still collecting
>>> my thoughts on this one.
>>>
>>> Thanks folks! Please also let me know if any of the following links are
>>> inaccessible for any reason.
>>>
>>> Meeting recording link:
>>> https://drive.google.com/file/d/1gv8TrR5xzqqNxek7_sTZkpbwQx1M3dhK/view
>>> Meeting summary:
>>> https://docs.google.com/document/d/131N0CDpzZczURxitN0HGS7dTqRxQT_YS9jMECkGGvQU
>>>
>>> On Mon, Sep 8, 2025 at 3:40 PM Amogh Jahagirdar <[email protected]>
>>> wrote:
>>>
>>>> Update: I moved the discussion time to this Friday at 9 am PST since I
>>>> found out that quite a few folks involved in the proposals will be out next
>>>> week, and I also know some folks will also be out the week after that.
>>>>
>>>> Thanks,
>>>> Amogh J
>>>>
>>>> On Mon, Sep 8, 2025 at 8:57 AM Amogh Jahagirdar <[email protected]>
>>>> wrote:
>>>>
>>>>> Hey folks sorry for the late follow up here,
>>>>>
>>>>> Thanks @Kevin Liu <[email protected]> for sharing the recording
>>>>> link of the previous discussion! I've set up another sync for next Tuesday
>>>>> 09/16 at 9am PST. This time I've set it up from my corporate email so we
>>>>> can get recordings and transcriptions (and I've made sure to keep the
>>>>> meeting invite open so we don't have to manually let people in).
>>>>>
>>>>> In terms of next steps of areas which I think would be good to focus
>>>>> on for establishing consensus:
>>>>>
>>>>> 1. How do we model the manifest entry structure so that changes to
>>>>> manifest DVs can be obtained easily from the root? There are a few options
>>>>> here; the most promising approach is to keep an additional DV which 
>>>>> encodes
>>>>> the diff in additional positions which have been removed from a leaf
>>>>> manifest.
>>>>>
>>>>> 2. Modeling partition transforms via expressions and establishing a
>>>>> unified table ID space so that we can simplify how partition tuples may be
>>>>> represented via stats and also have a way in the future to store stats on
>>>>> any derived column. I have a short proposal
>>>>> <https://docs.google.com/document/d/1oV8dapKVzB4pZy5pKHUCj5j9i2_1p37BJSeT7hyKPpg/edit?tab=t.0>
>>>>>  for
>>>>> this that probably still needs some tightening up on the expression
>>>>> modeling itself (and some prototyping) but the general idea for
>>>>> establishing a unified table ID space is covered. All feedback welcome!
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Amogh Jahagirdar
>>>>>
>>>>> On Mon, Aug 25, 2025 at 1:34 PM Kevin Liu <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks Amogh. Looks like the recording for last week's sync is
>>>>>> available on Youtube. Here's the link,
>>>>>> https://www.youtube.com/watch?v=uWm-p--8oVQ
>>>>>>
>>>>>> Best,
>>>>>> Kevin Liu
>>>>>>
>>>>>> On Tue, Aug 12, 2025 at 9:10 PM Amogh Jahagirdar <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey folks,
>>>>>>>
>>>>>>> Just following up on this to give the community as to where we're at
>>>>>>> and my proposed next steps.
>>>>>>>
>>>>>>> I've been editing and merging the contents from our proposal into
>>>>>>> the proposal
>>>>>>> <https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw>
>>>>>>>  from
>>>>>>> Russell and others. For any future comments on docs, please comment on 
>>>>>>> the
>>>>>>> linked proposal. I've also marked it on our doc in red text so it's 
>>>>>>> clear
>>>>>>> to redirect to the other proposal as a source of truth for comments.
>>>>>>>
>>>>>>> In terms of next steps,
>>>>>>>
>>>>>>> 1. An important design decision point is around inline manifest DVs,
>>>>>>> external manifest DVs or enabling both. I'm working on measuring 
>>>>>>> different
>>>>>>> approaches for representing the compressed DV representation since that
>>>>>>> will inform how many entries can reasonably fit in a small root 
>>>>>>> manifest;
>>>>>>> from that we can derive implications on different write patterns and
>>>>>>> determine the right approach for storing these manifest DVs.
>>>>>>>
>>>>>>> 2. Another key point is around determining if/how we can reasonably
>>>>>>> enable V4 to represent changes in the root manifest so that readers can
>>>>>>> effectively just infer file level changes from the root.
>>>>>>>
>>>>>>> 3. One of the aspects of the proposal is getting away from partition
>>>>>>> tuple requirement in the root which currently holds us to have
>>>>>>> associativity between a partition spec and a manifest. These aspects 
>>>>>>> can be
>>>>>>> modeled as essentially column stats which gives a lot of flexibility 
>>>>>>> into
>>>>>>> the organization of the manifest. There are important details around 
>>>>>>> field
>>>>>>> ID spaces here which tie into how the stats are structured. What we're
>>>>>>> proposing here is to have a unified expression ID space that could also
>>>>>>> benefit us for storing things like virtual columns down the line. I go 
>>>>>>> into
>>>>>>> this in the proposal but I'm working on separating the appropriate 
>>>>>>> parts so
>>>>>>> that the original proposal can mostly just focus on the organization of 
>>>>>>> the
>>>>>>> content metadata tree and not how we want to solve this particular ID 
>>>>>>> space
>>>>>>> problem.
>>>>>>>
>>>>>>> 4. I'm planning on scheduling a recurring community sync starting
>>>>>>> next Tuesday at 9am PST, every 2 weeks. If I get feedback from folks 
>>>>>>> that
>>>>>>> this time will never work, I can certainly adjust. For some reason, I 
>>>>>>> don't
>>>>>>> have the ability to add to the Iceberg Dev calendar, so I'll figure that
>>>>>>> out and update the thread when the event is scheduled.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Amogh Jahagirdar
>>>>>>>
>>>>>>> On Tue, Jul 22, 2025 at 11:47 AM Russell Spitzer <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> I think this is a great way forward, starting out with this much
>>>>>>>> parallel development shows that we have a lot of consensus already :)
>>>>>>>>
>>>>>>>> On Tue, Jul 22, 2025 at 12:42 PM Amogh Jahagirdar <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey folks, just following up on this. It looks like our proposal
>>>>>>>>> and the proposal that @Russell Spitzer <[email protected]> 
>>>>>>>>> shared
>>>>>>>>> are pretty aligned. I was just chatting with Russell about this, and 
>>>>>>>>> we
>>>>>>>>> think it'd be best to combine both proposals and have a singular large
>>>>>>>>> effort on this. I can also set up a focused community discussion 
>>>>>>>>> (similar
>>>>>>>>> to what we're doing on the other V4 proposals) on this starting 
>>>>>>>>> sometime
>>>>>>>>> next week just to get things moving, if that works for people.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>
>>>>>>>>> On Mon, Jul 14, 2025 at 9:48 PM Amogh Jahagirdar <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Russell,
>>>>>>>>>>
>>>>>>>>>> Thanks for sharing the proposal! A few of us (Ryan, Dan, Anoop
>>>>>>>>>> and I) have also been working on a proposal for an adaptive metadata 
>>>>>>>>>> tree
>>>>>>>>>> structure as part of enabling more efficient one file commits. From 
>>>>>>>>>> a read
>>>>>>>>>> of the summary, it's great to see that we're thinking along the same 
>>>>>>>>>> lines
>>>>>>>>>> about how to tackle this fundamental area!
>>>>>>>>>>
>>>>>>>>>> Here is our proposal:
>>>>>>>>>> https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0
>>>>>>>>>> <https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>
>>>>>>>>>> On Mon, Jul 14, 2025 at 8:08 PM Russell Spitzer <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey y'all!
>>>>>>>>>>>
>>>>>>>>>>> We (Yi Fang, Steven Wu and Myself) wanted to share some
>>>>>>>>>>> of the thoughts we had on how one-file commits could work in
>>>>>>>>>>> Iceberg. This is pretty
>>>>>>>>>>> much just a high level overview of the concepts we think we need
>>>>>>>>>>> and how Iceberg would behave.
>>>>>>>>>>> We haven't gone very far into the actual implementation and
>>>>>>>>>>> changes that would need to occur in the
>>>>>>>>>>> SDK to make this happen.
>>>>>>>>>>>
>>>>>>>>>>> The high level summary is:
>>>>>>>>>>>
>>>>>>>>>>> Manifest Lists are out
>>>>>>>>>>> Root Manifests take their place
>>>>>>>>>>>   A Root manifest can have data manifests, delete manifests,
>>>>>>>>>>> manifest delete vectors, data delete vectors and data files
>>>>>>>>>>>   Manifest delete vectors allow for modifying a manifest without
>>>>>>>>>>> deleting it entirely
>>>>>>>>>>>   Data files let you append without writing an intermediary
>>>>>>>>>>> manifest
>>>>>>>>>>>   Having child data and delete manifests lets you still scale
>>>>>>>>>>>
>>>>>>>>>>> Please take a look if you like,
>>>>>>>>>>>
>>>>>>>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0
>>>>>>>>>>>
>>>>>>>>>>> I'm excited to see what other proposals and Ideas are floating
>>>>>>>>>>> around the community,
>>>>>>>>>>> Russ
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jul 2, 2025 at 6:29 PM John Zhuge <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Very excited about the idea!
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jul 2, 2025 at 1:17 PM Anoop Johnson <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I'm very interested in this initiative. Micah Kornfield and I
>>>>>>>>>>>>> presented
>>>>>>>>>>>>> <https://youtu.be/4d4nqKkANdM?si=9TXgaUIXbq-l8idi&t=1405> on
>>>>>>>>>>>>> high-throughput ingestion for Iceberg tables at the 2024 Iceberg 
>>>>>>>>>>>>> Summit,
>>>>>>>>>>>>> which leveraged Google infrastructure like Colossus for efficient 
>>>>>>>>>>>>> appends.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This new proposal is particularly exciting because it offers
>>>>>>>>>>>>> significant advancements in commit latency and metadata storage 
>>>>>>>>>>>>> footprint.
>>>>>>>>>>>>> Furthermore, a consistent manifest structure promises to simplify 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> design and codebase, which is a major benefit.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A related idea I've been exploring is having a loose affinity
>>>>>>>>>>>>> between data and delete manifests. While the current separation 
>>>>>>>>>>>>> of data and
>>>>>>>>>>>>> delete manifests in Iceberg is valuable for avoiding data file 
>>>>>>>>>>>>> rewrites
>>>>>>>>>>>>> (and stats updates) when deletes change, it does necessitate a 
>>>>>>>>>>>>> join
>>>>>>>>>>>>> operation during reads. I'd be keen to discuss approaches that 
>>>>>>>>>>>>> could
>>>>>>>>>>>>> potentially reduce this read-side cost while retaining the 
>>>>>>>>>>>>> benefits of
>>>>>>>>>>>>> separate manifests.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Anoop
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Jun 13, 2025 at 11:06 AM Jagdeep Sidhu <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am new to the Iceberg community but would love to
>>>>>>>>>>>>>> participate in these discussions to reduce the number of file 
>>>>>>>>>>>>>> writes,
>>>>>>>>>>>>>> especially for small writes/commits.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you!
>>>>>>>>>>>>>> -Jagdeep
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Jun 5, 2025 at 4:02 PM Anurag Mantripragada
>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We have been hitting all the metadata problems you
>>>>>>>>>>>>>>> mentioned, Ryan. I’m on-board to help however I can to improve 
>>>>>>>>>>>>>>> this area.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ~ Anurag Mantripragada
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Jun 3, 2025, at 2:22 AM, Huang-Hsiang Cheng
>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am interested in this idea and looking forward to
>>>>>>>>>>>>>>> collaboration.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Huang-Hsiang
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Jun 2, 2025, at 10:14 AM, namratha mk <[email protected]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am interested in contributing to this effort.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Namratha
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, May 29, 2025 at 1:36 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for kicking this thread off Ryan, I'm interested in
>>>>>>>>>>>>>>>> helping out here! I've been working on a proposal in this area 
>>>>>>>>>>>>>>>> and it would
>>>>>>>>>>>>>>>> be great to collaborate with different folks and exchange 
>>>>>>>>>>>>>>>> ideas here, since
>>>>>>>>>>>>>>>> I think a lot of people are interested in solving this problem.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, May 29, 2025 at 2:25 PM Ryan Blue <[email protected]>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Like Russell’s recent note, I’m starting a thread to
>>>>>>>>>>>>>>>>> connect those of us that are interested in the idea of 
>>>>>>>>>>>>>>>>> changing Iceberg’s
>>>>>>>>>>>>>>>>> metadata in v4 so that in most cases committing a change only 
>>>>>>>>>>>>>>>>> requires
>>>>>>>>>>>>>>>>> writing one additional metadata file.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *Idea: One-file commits*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The current Iceberg metadata structure requires writing at
>>>>>>>>>>>>>>>>> least one manifest and a new manifest list to produce a new 
>>>>>>>>>>>>>>>>> snapshot. The
>>>>>>>>>>>>>>>>> goal of this work is to allow more flexibility by allowing 
>>>>>>>>>>>>>>>>> the manifest
>>>>>>>>>>>>>>>>> list layer to store data and delete files. As a result, only 
>>>>>>>>>>>>>>>>> one file write
>>>>>>>>>>>>>>>>> would be needed before committing the new snapshot. In 
>>>>>>>>>>>>>>>>> addition, this work
>>>>>>>>>>>>>>>>> will also try to explore:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - Avoiding small manifests that must be read in
>>>>>>>>>>>>>>>>>    parallel and later compacted (metadata maintenance changes)
>>>>>>>>>>>>>>>>>    - Extend metadata skipping to use aggregated column
>>>>>>>>>>>>>>>>>    ranges that are compatible with geospatial data (manifest 
>>>>>>>>>>>>>>>>> metadata)
>>>>>>>>>>>>>>>>>    - Using soft deletes to avoid rewriting existing
>>>>>>>>>>>>>>>>>    manifests (metadata DVs)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If you’re interested in these problems, please reply!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> John Zhuge
>>>>>>>>>>>>
>>>>>>>>>>>

Re: [DISCUSS] v4 - One file commits

Reply via email to