I agree that a large list of Snapshots is where we see the most
metadata.json bloat and this is something that we should address.
Also, my understanding is that both the recommendations discussed here
(optional persistence of metadata.json and changing the structure of
metadata.json) would only improve the commit/write performance without
addressing reader who have to still read the whole metadata.json (or for a
given branch) deserialized as TableMetadata. I believe that we should
explore solutions that also address the large metadata size in the
loadTable API response since many services have an upper limit on the
response size (eg: AWS API gateway has 10MB limit ref:
https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-execution-service-limits-table.html)
.
I'm also biased towards a solution that pages out the old snapshots and
uses a referencing mechanism in the metadata.json but in addition port the
same mechanism to TableMetadata and LoadTableResponse so it's not merely a
storage/write optimization.
+1 for a detailed proposal where we make intentional tradeoffs for
performances in either less frequently used (time travel) or async paths
(snapshot expiration) while making sure that hot paths (commit and recent
reads) gain performance.

On Fri, Feb 13, 2026 at 4:58 PM Shawn Chang <[email protected]> wrote:

> +1 on Anton’s idea of archiving older snapshot states.
>
> Since these older states are rarely accessed, archiving them separately
> and possibly compressing them seems like a practical solution.
>
> That said, this calls for a detailed proposal so we don't shoot ourselves
> in the foot. Some details can be very critical, such as what kind of knobs
> we provide to users so they can control the archival window and strategies.
>
> Best,
> Shawn
>
> On Thu, Feb 12, 2026 at 1:02 PM Anton Okolnychyi <[email protected]>
> wrote:
>
>> Given that there were past discussions or ideas on how to make writing of
>> the root file optional, is anyone willing to create a proposal?
>>
>> Yufei, I agree that portability is critical and non-negotiable. That
>> said, keeping the root file in the spec and requiring catalogs to produce
>> one if needed seems to address that?
>>
>> Fokko, Eduard, Alex, if we decide to offload some snapshot state, I think
>> it would apply only to historic and unused / rarely used snapshots. The
>> goal is not to rewrite the entire snapshot log on each commit as well as
>> not to introduce more IO for most common operations. Not all catalogs will
>> be able to avoid generating the root metadata file, therefore exploring
>> this path still seems beneficial? What do you think?
>>
>> чт, 12 лют. 2026 р. о 07:23 Alex Stephen via dev <[email protected]>
>> пише:
>>
>>> Agree with the thoughts about creating separate files. Another thing to
>>> note is that many cloud storage providers charge for operations, where the
>>> numbers of operations scales with the number of files.
>>>
>>> This only really matters at great scale (and egress costs will always
>>> dwarf operations), but having multiple metadata files will lead to an
>>> increase in operations
>>>
>>> On Thu, Feb 12, 2026 at 7:17 AM Eduard Tudenhöfner <
>>> [email protected]> wrote:
>>>
>>>> I have the same concerns as Fokko w.r.t offloading the snapshot history
>>>> into a separate file. I think it makes sense to focus on making the
>>>> metadata optional but have a way of generating it on demand when needed.
>>>>
>>>> On Thu, Feb 12, 2026 at 2:28 PM Fokko Driesprong <[email protected]>
>>>> wrote:
>>>>
>>>>> Hey all,
>>>>>
>>>>> I agree that making the metadata optional would solve a lot of issues
>>>>> that we see today. I do think it makes sense to be able to request the
>>>>> metadata json if needed.
>>>>>
>>>>> For me, for example, restructuring the snapshot history sounds to be
>>>>> more problematic. We would have the {Hive/Glue/Sql}Catalog to write
>>>>> two files of which the unbounded could take some time to write. Wouldn't 
>>>>> we
>>>>> be shifting the problem, while introducing a lot of complexity at the same
>>>>> time?
>>>>>
>>>>> Kind regards,
>>>>> Fokko
>>>>> Op wo 11 feb 2026 om 22:53 schreef Yufei Gu <[email protected]>:
>>>>>
>>>>>> I think Gang has a good point: we are discussing two things here.
>>>>>> First, Optional metadata.json file, 2. Restructuring the metadata.json. 
>>>>>> The
>>>>>> first one is more controversial as it impacts the portability. I'd 
>>>>>> suggest
>>>>>> to start with the second one, we could try to move unbounded 
>>>>>> metadata(e.g.,
>>>>>> snapshot history) out of metadata.json file.
>>>>>> Yufei
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 11, 2026 at 8:38 AM Anton Okolnychyi <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> It looks like there is enough interest in the community and a few
>>>>>>> good ways to make the generation of the root metadata optional with some
>>>>>>> smarter catalogs. This would get us to truly single file commits in V4
>>>>>>> without sacrificing portability if catalogs are required to generate the
>>>>>>> root metadata file on demand.
>>>>>>>
>>>>>>> What should we do with built-in catalogs? Streaming appends should
>>>>>>> be supported out of the box without requiring aggressive table 
>>>>>>> maintenance.
>>>>>>> It seems that not writing the root metadata for HMS is going to be a 
>>>>>>> lot of
>>>>>>> work. Any thoughts? Should we then pursue offloading the snapshot 
>>>>>>> history
>>>>>>> for such catalogs?
>>>>>>>
>>>>>>> вт, 10 лют. 2026 р. о 18:53 Anoop Johnson <[email protected]> пише:
>>>>>>>
>>>>>>>> Agree that snapshot history is the main bloat factor. We've seen
>>>>>>>> fast moving tables where writing the metadata.json file takes several
>>>>>>>> seconds. For comparison, Delta Lake uses an efficient binary-search 
>>>>>>>> based
>>>>>>>> time travel that can scale to O(millions) of table versions.
>>>>>>>>
>>>>>>>> Rather than limiting snapshot retention, we might want to consider
>>>>>>>> adding time travel directly to the IRC spec. The catalog could 
>>>>>>>> implement
>>>>>>>> scalable time travel using appropriate indexing. So GetTable API could
>>>>>>>> accept an optional `AS OF` timestamp param and return the table 
>>>>>>>> metadata as
>>>>>>>> of that timestamp. This would enable catalog implementations to choose
>>>>>>>> their own time travel strategy (indexes, bloom filters, etc.) Catalogs 
>>>>>>>> that
>>>>>>>> don't support time travel could return an error and clients fall back 
>>>>>>>> to
>>>>>>>> current behavior.
>>>>>>>>
>>>>>>>> I also like Prashant's solution to the portability concern by
>>>>>>>> having catalogs materialize the metadata.json on-demand through an 
>>>>>>>> export
>>>>>>>> API when needed for migration scenarios.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 10, 2026 at 6:27 PM Gang Wu <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> It seems that we are discussing on two orthogonal approaches:
>>>>>>>>>
>>>>>>>>> 1. Making the writing of the complete metadata.json file optional
>>>>>>>>> during a commit, especially for catalogs that can manage metadata
>>>>>>>>> themselves.
>>>>>>>>> 2. Restructuring the metadata.json file (e.g., by offloading
>>>>>>>>> growing
>>>>>>>>> parts like snapshot history to external files) to limit its size
>>>>>>>>> and
>>>>>>>>> reduce write I/O, while still requiring the root file on every
>>>>>>>>> commit
>>>>>>>>> for portability.
>>>>>>>>>
>>>>>>>>> I believe both approaches are worth exploring because in some cases
>>>>>>>>> portability is still a top priority.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Gang
>>>>>>>>>
>>>>>>>>> On Wed, Feb 11, 2026 at 9:27 AM Manu Zhang <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> >
>>>>>>>>> > Can we add an abstraction to spec like root metadata (or
>>>>>>>>> snapshot history manager) with the default implementation being
>>>>>>>>> metadata.json?
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > On Wed, Feb 11, 2026 at 9:07 AM Prashant Singh <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> >>
>>>>>>>>> >> +1 i think snapshot summary bloating was a major factor for
>>>>>>>>> bloating metadata.json too specially for streaming writer based on my 
>>>>>>>>> past
>>>>>>>>> exp, one other way since we didn't wanted to propose the spec change 
>>>>>>>>> was to
>>>>>>>>> have strict requirement on how many snapshot we wanted to keep and 
>>>>>>>>> let the
>>>>>>>>> remove orphans do the clean up, also we removed the snapshot summaries
>>>>>>>>> since they are optional anyways in addition to as in streaming mode we
>>>>>>>>> create a large number of snapshot (not all were required anyways).
>>>>>>>>> >> I believe there had been a lot of interesting discussion to
>>>>>>>>> optimize read [1] as well as write [2] if we are open to make spec a 
>>>>>>>>> bit
>>>>>>>>> relaxed, it would be nice to move to the tracking of the metadata to 
>>>>>>>>> the
>>>>>>>>> catalog and then a protocol to retrieve it back without compromising 
>>>>>>>>> the
>>>>>>>>> portability, maybe we can just have a dedicate api which can help 
>>>>>>>>> export
>>>>>>>>> this to a file and in an intermediate stage we just operate on what 
>>>>>>>>> we have
>>>>>>>>> stored in catalog and we just materialize to the file when and if 
>>>>>>>>> asked we
>>>>>>>>> are kind of having similar discussion in IRC.
>>>>>>>>> >>
>>>>>>>>> >> All i think acknowledge it being a real problem for streaming
>>>>>>>>> writers :) !
>>>>>>>>> >>
>>>>>>>>> >> Past discussions :
>>>>>>>>> >> [1]
>>>>>>>>> https://lists.apache.org/thread/pwdd7qmdsfcrzjtsll53d3m9f74d03l8
>>>>>>>>> >> [2] https://github.com/apache/iceberg/issues/2723
>>>>>>>>> >>
>>>>>>>>> >> Best,
>>>>>>>>> >> Prashant Singh
>>>>>>>>> >>
>>>>>>>>> >> On Tue, Feb 10, 2026 at 4:45 PM Anton Okolnychyi <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> >>>
>>>>>>>>> >>> I think Yufei is right and the snapshot history is the main
>>>>>>>>> contributor. Streaming jobs that write every minute would generate 
>>>>>>>>> over 10K
>>>>>>>>> of snapshot entries per week. We had a similar problem with the list 
>>>>>>>>> of
>>>>>>>>> manifests that kept growing (until we added manifest lists) and with
>>>>>>>>> references to previous metadata files (we only keep the last 100 
>>>>>>>>> now). So
>>>>>>>>> we can definitely come up with something for snapshot entries. We 
>>>>>>>>> will have
>>>>>>>>> to ensure the entire set of snapshots is reachable from the latest 
>>>>>>>>> root
>>>>>>>>> file, even if it requires multiple IO operations.
>>>>>>>>> >>>
>>>>>>>>> >>> The main question is whether we still want to require writing
>>>>>>>>> root JSON files during commits. If so, our commits will never be 
>>>>>>>>> single
>>>>>>>>> file commits. In V4, we will have to write the root manifest as well 
>>>>>>>>> as the
>>>>>>>>> root metadata file. I would prefer the second to be optional but we 
>>>>>>>>> will
>>>>>>>>> need to think about static tables and how to incorporate that in the 
>>>>>>>>> spec.
>>>>>>>>> >>>
>>>>>>>>> >>>
>>>>>>>>> >>>
>>>>>>>>> >>> вт, 10 лют. 2026 р. о 15:58 Yufei Gu <[email protected]>
>>>>>>>>> пише:
>>>>>>>>> >>>>
>>>>>>>>> >>>> AFAIK, the snapshot history is the main, if not the only,
>>>>>>>>> reason for the large metadata.json file. Moving the extra snapshot 
>>>>>>>>> history
>>>>>>>>> to additional file and keep it referenced in the root one may just 
>>>>>>>>> resolve
>>>>>>>>> the issue.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Yufei
>>>>>>>>> >>>>
>>>>>>>>> >>>>
>>>>>>>>> >>>> On Tue, Feb 10, 2026 at 3:27 PM huaxin gao <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> +1, I think this is a real problem, especially for streaming
>>>>>>>>> / frequent appends where commit latency matters and metadata.json 
>>>>>>>>> keeps
>>>>>>>>> getting bigger.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> I also agree we probably shouldn’t remove the root metadata
>>>>>>>>> file completely. Having one file that describes the whole table is 
>>>>>>>>> really
>>>>>>>>> useful for portability and debugging.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> Of the options you listed, I like “offload pieces to
>>>>>>>>> external files” as a first step. We still write the root file every 
>>>>>>>>> commit,
>>>>>>>>> but it won’t grow as fast. The downside is extra maintenance/GC 
>>>>>>>>> complexity.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> A couple questions/ideas:
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> Do we have any data on what parts of metadata.json grow the
>>>>>>>>> most (snapshots / history / refs)? Even a rough breakdown could help 
>>>>>>>>> decide
>>>>>>>>> what to move out first.
>>>>>>>>> >>>>> Could we do a hybrid: still write the root file every
>>>>>>>>> commit, but only keep a “recent window” in it, and move older history 
>>>>>>>>> to
>>>>>>>>> referenced files? (portable, but bounded growth)
>>>>>>>>> >>>>> For “optional on commit”, maybe make it a catalog capability
>>>>>>>>> (fast commits if the catalog can serve metadata), but still support an
>>>>>>>>> export/materialize step when portability is needed.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> Thanks,
>>>>>>>>> >>>>> Huaxin
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> On Tue, Feb 10, 2026 at 2:58 PM Anton Okolnychyi <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> I don't think we have any consensus or concrete plan. In
>>>>>>>>> fact, I don't know what my personal preference is at this point. The
>>>>>>>>> intention of this thread is to gain that clarity. I don't think 
>>>>>>>>> removing
>>>>>>>>> the root metadata file entirely is a good idea. It is great to have a 
>>>>>>>>> way
>>>>>>>>> to describe the entire state of a table in a file. We just need to 
>>>>>>>>> find a
>>>>>>>>> solution for streaming appends that suffer from the increasing size 
>>>>>>>>> of the
>>>>>>>>> root metadata file.
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> Like I said, making the generation of the json file on
>>>>>>>>> commit optional is one way to solve this problem. We can also think 
>>>>>>>>> about
>>>>>>>>> offloading pieces of it to external files (say old snapshots). This 
>>>>>>>>> would
>>>>>>>>> mean we still have to write the root file on each commit but it will 
>>>>>>>>> be
>>>>>>>>> smaller. One clear downside is more complicated maintenance.
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> Any other ideas/thoughts/feedback? Do people see this as a
>>>>>>>>> problem?
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> вт, 10 лют. 2026 р. о 14:18 Yufei Gu <[email protected]>
>>>>>>>>> пише:
>>>>>>>>> >>>>>>>
>>>>>>>>> >>>>>>> Hi Anton, thanks for raising this. I would really like to
>>>>>>>>> make this optional and then build additional use cases on top of it. 
>>>>>>>>> For
>>>>>>>>> example, a catalog like IRC could completely eliminate storage IO 
>>>>>>>>> during
>>>>>>>>> commit and load, which is a big win. It could also provide better
>>>>>>>>> protection for encrypted Iceberg tables, since metadata.json files are
>>>>>>>>> plain text today.
>>>>>>>>> >>>>>>>
>>>>>>>>> >>>>>>> That said, do we have consensus that metadata.json can be
>>>>>>>>> optional? There are real portability concerns, and engine-side work 
>>>>>>>>> also
>>>>>>>>> needs consideration. For example, static tables and the Spark driver 
>>>>>>>>> still
>>>>>>>>> expect to read this file directly from storage. It feels like the 
>>>>>>>>> first
>>>>>>>>> step here is aligning on whether metadata.json can be optional at all,
>>>>>>>>> before we go deeper into how we get rid of. What do you think?
>>>>>>>>> >>>>>>>
>>>>>>>>> >>>>>>> Yufei
>>>>>>>>> >>>>>>>
>>>>>>>>> >>>>>>>
>>>>>>>>> >>>>>>> On Tue, Feb 10, 2026 at 11:23 AM Anton Okolnychyi <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> >>>>>>>>
>>>>>>>>> >>>>>>>> While it may be common knowledge among Iceberg devs that
>>>>>>>>> writing the root JSON file on commit is somewhat optional with a right
>>>>>>>>> catalog, what can we do in V4 to solve this problem for all? My 
>>>>>>>>> problem is
>>>>>>>>> the suboptimal behavior that new users get by default with HMS or 
>>>>>>>>> Hadoop
>>>>>>>>> catalogs and how this impacts their perception of Iceberg. We are 
>>>>>>>>> doing a
>>>>>>>>> bunch of work for streaming (e.g. changelog scans, single file 
>>>>>>>>> commits,
>>>>>>>>> etc), but the need to write the root JSON file may cancel all of that.
>>>>>>>>> >>>>>>>>
>>>>>>>>> >>>>>>>> Let me throw some ideas out there.
>>>>>>>>> >>>>>>>>
>>>>>>>>> >>>>>>>> - Describe how catalogs can make the generation of the
>>>>>>>>> root metadata file optional in the spec. Ideally, implement that in a
>>>>>>>>> built-in catalog of choice as a reference implementation.
>>>>>>>>> >>>>>>>> - Offload portions of the root metadata file to external
>>>>>>>>> files and keep references to them.
>>>>>>>>> >>>>>>>>
>>>>>>>>> >>>>>>>> Thoughts?
>>>>>>>>> >>>>>>>>
>>>>>>>>> >>>>>>>> - Anton
>>>>>>>>> >>>>>>>>
>>>>>>>>> >>>>>>>>
>>>>>>>>>
>>>>>>>>

Reply via email to