Re: [DISCUSS] metadata.json in v4?

Eduard Tudenhöfner Thu, 12 Feb 2026 07:18:06 -0800

I have the same concerns as Fokko w.r.t offloading the snapshot history
into a separate file. I think it makes sense to focus on making the
metadata optional but have a way of generating it on demand when needed.


On Thu, Feb 12, 2026 at 2:28 PM Fokko Driesprong <[email protected]> wrote:

> Hey all,
>
> I agree that making the metadata optional would solve a lot of issues that
> we see today. I do think it makes sense to be able to request the metadata
> json if needed.
>
> For me, for example, restructuring the snapshot history sounds to be more
> problematic. We would have the {Hive/Glue/Sql}Catalog to write two files
> of which the unbounded could take some time to write. Wouldn't we be
> shifting the problem, while introducing a lot of complexity at the same
> time?
>
> Kind regards,
> Fokko
> Op wo 11 feb 2026 om 22:53 schreef Yufei Gu <[email protected]>:
>
>> I think Gang has a good point: we are discussing two things here.
>> First, Optional metadata.json file, 2. Restructuring the metadata.json. The
>> first one is more controversial as it impacts the portability. I'd suggest
>> to start with the second one, we could try to move unbounded metadata(e.g.,
>> snapshot history) out of metadata.json file.
>> Yufei
>>
>>
>> On Wed, Feb 11, 2026 at 8:38 AM Anton Okolnychyi <[email protected]>
>> wrote:
>>
>>> It looks like there is enough interest in the community and a few good
>>> ways to make the generation of the root metadata optional with some smarter
>>> catalogs. This would get us to truly single file commits in V4 without
>>> sacrificing portability if catalogs are required to generate the root
>>> metadata file on demand.
>>>
>>> What should we do with built-in catalogs? Streaming appends should be
>>> supported out of the box without requiring aggressive table maintenance. It
>>> seems that not writing the root metadata for HMS is going to be a lot of
>>> work. Any thoughts? Should we then pursue offloading the snapshot history
>>> for such catalogs?
>>>
>>> вт, 10 лют. 2026 р. о 18:53 Anoop Johnson <[email protected]> пише:
>>>
>>>> Agree that snapshot history is the main bloat factor. We've seen fast
>>>> moving tables where writing the metadata.json file takes several
>>>> seconds. For comparison, Delta Lake uses an efficient binary-search based
>>>> time travel that can scale to O(millions) of table versions.
>>>>
>>>> Rather than limiting snapshot retention, we might want to consider
>>>> adding time travel directly to the IRC spec. The catalog could implement
>>>> scalable time travel using appropriate indexing. So GetTable API could
>>>> accept an optional `AS OF` timestamp param and return the table metadata as
>>>> of that timestamp. This would enable catalog implementations to choose
>>>> their own time travel strategy (indexes, bloom filters, etc.) Catalogs that
>>>> don't support time travel could return an error and clients fall back to
>>>> current behavior.
>>>>
>>>> I also like Prashant's solution to the portability concern by having
>>>> catalogs materialize the metadata.json on-demand through an export API when
>>>> needed for migration scenarios.
>>>>
>>>>
>>>> On Tue, Feb 10, 2026 at 6:27 PM Gang Wu <[email protected]> wrote:
>>>>
>>>>> It seems that we are discussing on two orthogonal approaches:
>>>>>
>>>>> 1. Making the writing of the complete metadata.json file optional
>>>>> during a commit, especially for catalogs that can manage metadata
>>>>> themselves.
>>>>> 2. Restructuring the metadata.json file (e.g., by offloading growing
>>>>> parts like snapshot history to external files) to limit its size and
>>>>> reduce write I/O, while still requiring the root file on every commit
>>>>> for portability.
>>>>>
>>>>> I believe both approaches are worth exploring because in some cases
>>>>> portability is still a top priority.
>>>>>
>>>>> Best,
>>>>> Gang
>>>>>
>>>>> On Wed, Feb 11, 2026 at 9:27 AM Manu Zhang <[email protected]>
>>>>> wrote:
>>>>> >
>>>>> > Can we add an abstraction to spec like root metadata (or snapshot
>>>>> history manager) with the default implementation being metadata.json?
>>>>> >
>>>>> >
>>>>> > On Wed, Feb 11, 2026 at 9:07 AM Prashant Singh <
>>>>> [email protected]> wrote:
>>>>> >>
>>>>> >> +1 i think snapshot summary bloating was a major factor for
>>>>> bloating metadata.json too specially for streaming writer based on my past
>>>>> exp, one other way since we didn't wanted to propose the spec change was 
>>>>> to
>>>>> have strict requirement on how many snapshot we wanted to keep and let the
>>>>> remove orphans do the clean up, also we removed the snapshot summaries
>>>>> since they are optional anyways in addition to as in streaming mode we
>>>>> create a large number of snapshot (not all were required anyways).
>>>>> >> I believe there had been a lot of interesting discussion to
>>>>> optimize read [1] as well as write [2] if we are open to make spec a bit
>>>>> relaxed, it would be nice to move to the tracking of the metadata to the
>>>>> catalog and then a protocol to retrieve it back without compromising the
>>>>> portability, maybe we can just have a dedicate api which can help export
>>>>> this to a file and in an intermediate stage we just operate on what we 
>>>>> have
>>>>> stored in catalog and we just materialize to the file when and if asked we
>>>>> are kind of having similar discussion in IRC.
>>>>> >>
>>>>> >> All i think acknowledge it being a real problem for streaming
>>>>> writers :) !
>>>>> >>
>>>>> >> Past discussions :
>>>>> >> [1]
>>>>> https://lists.apache.org/thread/pwdd7qmdsfcrzjtsll53d3m9f74d03l8
>>>>> >> [2] https://github.com/apache/iceberg/issues/2723
>>>>> >>
>>>>> >> Best,
>>>>> >> Prashant Singh
>>>>> >>
>>>>> >> On Tue, Feb 10, 2026 at 4:45 PM Anton Okolnychyi <
>>>>> [email protected]> wrote:
>>>>> >>>
>>>>> >>> I think Yufei is right and the snapshot history is the main
>>>>> contributor. Streaming jobs that write every minute would generate over 
>>>>> 10K
>>>>> of snapshot entries per week. We had a similar problem with the list of
>>>>> manifests that kept growing (until we added manifest lists) and with
>>>>> references to previous metadata files (we only keep the last 100 now). So
>>>>> we can definitely come up with something for snapshot entries. We will 
>>>>> have
>>>>> to ensure the entire set of snapshots is reachable from the latest root
>>>>> file, even if it requires multiple IO operations.
>>>>> >>>
>>>>> >>> The main question is whether we still want to require writing root
>>>>> JSON files during commits. If so, our commits will never be single file
>>>>> commits. In V4, we will have to write the root manifest as well as the 
>>>>> root
>>>>> metadata file. I would prefer the second to be optional but we will need 
>>>>> to
>>>>> think about static tables and how to incorporate that in the spec.
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> вт, 10 лют. 2026 р. о 15:58 Yufei Gu <[email protected]> пише:
>>>>> >>>>
>>>>> >>>> AFAIK, the snapshot history is the main, if not the only, reason
>>>>> for the large metadata.json file. Moving the extra snapshot history to
>>>>> additional file and keep it referenced in the root one may just resolve 
>>>>> the
>>>>> issue.
>>>>> >>>>
>>>>> >>>> Yufei
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> On Tue, Feb 10, 2026 at 3:27 PM huaxin gao <
>>>>> [email protected]> wrote:
>>>>> >>>>>
>>>>> >>>>> +1, I think this is a real problem, especially for streaming /
>>>>> frequent appends where commit latency matters and metadata.json keeps
>>>>> getting bigger.
>>>>> >>>>>
>>>>> >>>>> I also agree we probably shouldn’t remove the root metadata file
>>>>> completely. Having one file that describes the whole table is really 
>>>>> useful
>>>>> for portability and debugging.
>>>>> >>>>>
>>>>> >>>>> Of the options you listed, I like “offload pieces to external
>>>>> files” as a first step. We still write the root file every commit, but it
>>>>> won’t grow as fast. The downside is extra maintenance/GC complexity.
>>>>> >>>>>
>>>>> >>>>> A couple questions/ideas:
>>>>> >>>>>
>>>>> >>>>> Do we have any data on what parts of metadata.json grow the most
>>>>> (snapshots / history / refs)? Even a rough breakdown could help decide 
>>>>> what
>>>>> to move out first.
>>>>> >>>>> Could we do a hybrid: still write the root file every commit,
>>>>> but only keep a “recent window” in it, and move older history to 
>>>>> referenced
>>>>> files? (portable, but bounded growth)
>>>>> >>>>> For “optional on commit”, maybe make it a catalog capability
>>>>> (fast commits if the catalog can serve metadata), but still support an
>>>>> export/materialize step when portability is needed.
>>>>> >>>>>
>>>>> >>>>> Thanks,
>>>>> >>>>> Huaxin
>>>>> >>>>>
>>>>> >>>>> On Tue, Feb 10, 2026 at 2:58 PM Anton Okolnychyi <
>>>>> [email protected]> wrote:
>>>>> >>>>>>
>>>>> >>>>>> I don't think we have any consensus or concrete plan. In fact,
>>>>> I don't know what my personal preference is at this point. The intention 
>>>>> of
>>>>> this thread is to gain that clarity. I don't think removing the root
>>>>> metadata file entirely is a good idea. It is great to have a way to
>>>>> describe the entire state of a table in a file. We just need to find a
>>>>> solution for streaming appends that suffer from the increasing size of the
>>>>> root metadata file.
>>>>> >>>>>>
>>>>> >>>>>> Like I said, making the generation of the json file on commit
>>>>> optional is one way to solve this problem. We can also think about
>>>>> offloading pieces of it to external files (say old snapshots). This would
>>>>> mean we still have to write the root file on each commit but it will be
>>>>> smaller. One clear downside is more complicated maintenance.
>>>>> >>>>>>
>>>>> >>>>>> Any other ideas/thoughts/feedback? Do people see this as a
>>>>> problem?
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> вт, 10 лют. 2026 р. о 14:18 Yufei Gu <[email protected]>
>>>>> пише:
>>>>> >>>>>>>
>>>>> >>>>>>> Hi Anton, thanks for raising this. I would really like to make
>>>>> this optional and then build additional use cases on top of it. For
>>>>> example, a catalog like IRC could completely eliminate storage IO during
>>>>> commit and load, which is a big win. It could also provide better
>>>>> protection for encrypted Iceberg tables, since metadata.json files are
>>>>> plain text today.
>>>>> >>>>>>>
>>>>> >>>>>>> That said, do we have consensus that metadata.json can be
>>>>> optional? There are real portability concerns, and engine-side work also
>>>>> needs consideration. For example, static tables and the Spark driver still
>>>>> expect to read this file directly from storage. It feels like the first
>>>>> step here is aligning on whether metadata.json can be optional at all,
>>>>> before we go deeper into how we get rid of. What do you think?
>>>>> >>>>>>>
>>>>> >>>>>>> Yufei
>>>>> >>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>> On Tue, Feb 10, 2026 at 11:23 AM Anton Okolnychyi <
>>>>> [email protected]> wrote:
>>>>> >>>>>>>>
>>>>> >>>>>>>> While it may be common knowledge among Iceberg devs that
>>>>> writing the root JSON file on commit is somewhat optional with a right
>>>>> catalog, what can we do in V4 to solve this problem for all? My problem is
>>>>> the suboptimal behavior that new users get by default with HMS or Hadoop
>>>>> catalogs and how this impacts their perception of Iceberg. We are doing a
>>>>> bunch of work for streaming (e.g. changelog scans, single file commits,
>>>>> etc), but the need to write the root JSON file may cancel all of that.
>>>>> >>>>>>>>
>>>>> >>>>>>>> Let me throw some ideas out there.
>>>>> >>>>>>>>
>>>>> >>>>>>>> - Describe how catalogs can make the generation of the root
>>>>> metadata file optional in the spec. Ideally, implement that in a built-in
>>>>> catalog of choice as a reference implementation.
>>>>> >>>>>>>> - Offload portions of the root metadata file to external
>>>>> files and keep references to them.
>>>>> >>>>>>>>
>>>>> >>>>>>>> Thoughts?
>>>>> >>>>>>>>
>>>>> >>>>>>>> - Anton
>>>>> >>>>>>>>
>>>>> >>>>>>>>
>>>>>
>>>>

Re: [DISCUSS] metadata.json in v4?

Reply via email to