I have the same concerns as Fokko w.r.t offloading the snapshot history into a separate file. I think it makes sense to focus on making the metadata optional but have a way of generating it on demand when needed.
On Thu, Feb 12, 2026 at 2:28 PM Fokko Driesprong <[email protected]> wrote: > Hey all, > > I agree that making the metadata optional would solve a lot of issues that > we see today. I do think it makes sense to be able to request the metadata > json if needed. > > For me, for example, restructuring the snapshot history sounds to be more > problematic. We would have the {Hive/Glue/Sql}Catalog to write two files > of which the unbounded could take some time to write. Wouldn't we be > shifting the problem, while introducing a lot of complexity at the same > time? > > Kind regards, > Fokko > Op wo 11 feb 2026 om 22:53 schreef Yufei Gu <[email protected]>: > >> I think Gang has a good point: we are discussing two things here. >> First, Optional metadata.json file, 2. Restructuring the metadata.json. The >> first one is more controversial as it impacts the portability. I'd suggest >> to start with the second one, we could try to move unbounded metadata(e.g., >> snapshot history) out of metadata.json file. >> Yufei >> >> >> On Wed, Feb 11, 2026 at 8:38 AM Anton Okolnychyi <[email protected]> >> wrote: >> >>> It looks like there is enough interest in the community and a few good >>> ways to make the generation of the root metadata optional with some smarter >>> catalogs. This would get us to truly single file commits in V4 without >>> sacrificing portability if catalogs are required to generate the root >>> metadata file on demand. >>> >>> What should we do with built-in catalogs? Streaming appends should be >>> supported out of the box without requiring aggressive table maintenance. It >>> seems that not writing the root metadata for HMS is going to be a lot of >>> work. Any thoughts? Should we then pursue offloading the snapshot history >>> for such catalogs? >>> >>> вт, 10 лют. 2026 р. о 18:53 Anoop Johnson <[email protected]> пише: >>> >>>> Agree that snapshot history is the main bloat factor. We've seen fast >>>> moving tables where writing the metadata.json file takes several >>>> seconds. For comparison, Delta Lake uses an efficient binary-search based >>>> time travel that can scale to O(millions) of table versions. >>>> >>>> Rather than limiting snapshot retention, we might want to consider >>>> adding time travel directly to the IRC spec. The catalog could implement >>>> scalable time travel using appropriate indexing. So GetTable API could >>>> accept an optional `AS OF` timestamp param and return the table metadata as >>>> of that timestamp. This would enable catalog implementations to choose >>>> their own time travel strategy (indexes, bloom filters, etc.) Catalogs that >>>> don't support time travel could return an error and clients fall back to >>>> current behavior. >>>> >>>> I also like Prashant's solution to the portability concern by having >>>> catalogs materialize the metadata.json on-demand through an export API when >>>> needed for migration scenarios. >>>> >>>> >>>> On Tue, Feb 10, 2026 at 6:27 PM Gang Wu <[email protected]> wrote: >>>> >>>>> It seems that we are discussing on two orthogonal approaches: >>>>> >>>>> 1. Making the writing of the complete metadata.json file optional >>>>> during a commit, especially for catalogs that can manage metadata >>>>> themselves. >>>>> 2. Restructuring the metadata.json file (e.g., by offloading growing >>>>> parts like snapshot history to external files) to limit its size and >>>>> reduce write I/O, while still requiring the root file on every commit >>>>> for portability. >>>>> >>>>> I believe both approaches are worth exploring because in some cases >>>>> portability is still a top priority. >>>>> >>>>> Best, >>>>> Gang >>>>> >>>>> On Wed, Feb 11, 2026 at 9:27 AM Manu Zhang <[email protected]> >>>>> wrote: >>>>> > >>>>> > Can we add an abstraction to spec like root metadata (or snapshot >>>>> history manager) with the default implementation being metadata.json? >>>>> > >>>>> > >>>>> > On Wed, Feb 11, 2026 at 9:07 AM Prashant Singh < >>>>> [email protected]> wrote: >>>>> >> >>>>> >> +1 i think snapshot summary bloating was a major factor for >>>>> bloating metadata.json too specially for streaming writer based on my past >>>>> exp, one other way since we didn't wanted to propose the spec change was >>>>> to >>>>> have strict requirement on how many snapshot we wanted to keep and let the >>>>> remove orphans do the clean up, also we removed the snapshot summaries >>>>> since they are optional anyways in addition to as in streaming mode we >>>>> create a large number of snapshot (not all were required anyways). >>>>> >> I believe there had been a lot of interesting discussion to >>>>> optimize read [1] as well as write [2] if we are open to make spec a bit >>>>> relaxed, it would be nice to move to the tracking of the metadata to the >>>>> catalog and then a protocol to retrieve it back without compromising the >>>>> portability, maybe we can just have a dedicate api which can help export >>>>> this to a file and in an intermediate stage we just operate on what we >>>>> have >>>>> stored in catalog and we just materialize to the file when and if asked we >>>>> are kind of having similar discussion in IRC. >>>>> >> >>>>> >> All i think acknowledge it being a real problem for streaming >>>>> writers :) ! >>>>> >> >>>>> >> Past discussions : >>>>> >> [1] >>>>> https://lists.apache.org/thread/pwdd7qmdsfcrzjtsll53d3m9f74d03l8 >>>>> >> [2] https://github.com/apache/iceberg/issues/2723 >>>>> >> >>>>> >> Best, >>>>> >> Prashant Singh >>>>> >> >>>>> >> On Tue, Feb 10, 2026 at 4:45 PM Anton Okolnychyi < >>>>> [email protected]> wrote: >>>>> >>> >>>>> >>> I think Yufei is right and the snapshot history is the main >>>>> contributor. Streaming jobs that write every minute would generate over >>>>> 10K >>>>> of snapshot entries per week. We had a similar problem with the list of >>>>> manifests that kept growing (until we added manifest lists) and with >>>>> references to previous metadata files (we only keep the last 100 now). So >>>>> we can definitely come up with something for snapshot entries. We will >>>>> have >>>>> to ensure the entire set of snapshots is reachable from the latest root >>>>> file, even if it requires multiple IO operations. >>>>> >>> >>>>> >>> The main question is whether we still want to require writing root >>>>> JSON files during commits. If so, our commits will never be single file >>>>> commits. In V4, we will have to write the root manifest as well as the >>>>> root >>>>> metadata file. I would prefer the second to be optional but we will need >>>>> to >>>>> think about static tables and how to incorporate that in the spec. >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> вт, 10 лют. 2026 р. о 15:58 Yufei Gu <[email protected]> пише: >>>>> >>>> >>>>> >>>> AFAIK, the snapshot history is the main, if not the only, reason >>>>> for the large metadata.json file. Moving the extra snapshot history to >>>>> additional file and keep it referenced in the root one may just resolve >>>>> the >>>>> issue. >>>>> >>>> >>>>> >>>> Yufei >>>>> >>>> >>>>> >>>> >>>>> >>>> On Tue, Feb 10, 2026 at 3:27 PM huaxin gao < >>>>> [email protected]> wrote: >>>>> >>>>> >>>>> >>>>> +1, I think this is a real problem, especially for streaming / >>>>> frequent appends where commit latency matters and metadata.json keeps >>>>> getting bigger. >>>>> >>>>> >>>>> >>>>> I also agree we probably shouldn’t remove the root metadata file >>>>> completely. Having one file that describes the whole table is really >>>>> useful >>>>> for portability and debugging. >>>>> >>>>> >>>>> >>>>> Of the options you listed, I like “offload pieces to external >>>>> files” as a first step. We still write the root file every commit, but it >>>>> won’t grow as fast. The downside is extra maintenance/GC complexity. >>>>> >>>>> >>>>> >>>>> A couple questions/ideas: >>>>> >>>>> >>>>> >>>>> Do we have any data on what parts of metadata.json grow the most >>>>> (snapshots / history / refs)? Even a rough breakdown could help decide >>>>> what >>>>> to move out first. >>>>> >>>>> Could we do a hybrid: still write the root file every commit, >>>>> but only keep a “recent window” in it, and move older history to >>>>> referenced >>>>> files? (portable, but bounded growth) >>>>> >>>>> For “optional on commit”, maybe make it a catalog capability >>>>> (fast commits if the catalog can serve metadata), but still support an >>>>> export/materialize step when portability is needed. >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Huaxin >>>>> >>>>> >>>>> >>>>> On Tue, Feb 10, 2026 at 2:58 PM Anton Okolnychyi < >>>>> [email protected]> wrote: >>>>> >>>>>> >>>>> >>>>>> I don't think we have any consensus or concrete plan. In fact, >>>>> I don't know what my personal preference is at this point. The intention >>>>> of >>>>> this thread is to gain that clarity. I don't think removing the root >>>>> metadata file entirely is a good idea. It is great to have a way to >>>>> describe the entire state of a table in a file. We just need to find a >>>>> solution for streaming appends that suffer from the increasing size of the >>>>> root metadata file. >>>>> >>>>>> >>>>> >>>>>> Like I said, making the generation of the json file on commit >>>>> optional is one way to solve this problem. We can also think about >>>>> offloading pieces of it to external files (say old snapshots). This would >>>>> mean we still have to write the root file on each commit but it will be >>>>> smaller. One clear downside is more complicated maintenance. >>>>> >>>>>> >>>>> >>>>>> Any other ideas/thoughts/feedback? Do people see this as a >>>>> problem? >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> вт, 10 лют. 2026 р. о 14:18 Yufei Gu <[email protected]> >>>>> пише: >>>>> >>>>>>> >>>>> >>>>>>> Hi Anton, thanks for raising this. I would really like to make >>>>> this optional and then build additional use cases on top of it. For >>>>> example, a catalog like IRC could completely eliminate storage IO during >>>>> commit and load, which is a big win. It could also provide better >>>>> protection for encrypted Iceberg tables, since metadata.json files are >>>>> plain text today. >>>>> >>>>>>> >>>>> >>>>>>> That said, do we have consensus that metadata.json can be >>>>> optional? There are real portability concerns, and engine-side work also >>>>> needs consideration. For example, static tables and the Spark driver still >>>>> expect to read this file directly from storage. It feels like the first >>>>> step here is aligning on whether metadata.json can be optional at all, >>>>> before we go deeper into how we get rid of. What do you think? >>>>> >>>>>>> >>>>> >>>>>>> Yufei >>>>> >>>>>>> >>>>> >>>>>>> >>>>> >>>>>>> On Tue, Feb 10, 2026 at 11:23 AM Anton Okolnychyi < >>>>> [email protected]> wrote: >>>>> >>>>>>>> >>>>> >>>>>>>> While it may be common knowledge among Iceberg devs that >>>>> writing the root JSON file on commit is somewhat optional with a right >>>>> catalog, what can we do in V4 to solve this problem for all? My problem is >>>>> the suboptimal behavior that new users get by default with HMS or Hadoop >>>>> catalogs and how this impacts their perception of Iceberg. We are doing a >>>>> bunch of work for streaming (e.g. changelog scans, single file commits, >>>>> etc), but the need to write the root JSON file may cancel all of that. >>>>> >>>>>>>> >>>>> >>>>>>>> Let me throw some ideas out there. >>>>> >>>>>>>> >>>>> >>>>>>>> - Describe how catalogs can make the generation of the root >>>>> metadata file optional in the spec. Ideally, implement that in a built-in >>>>> catalog of choice as a reference implementation. >>>>> >>>>>>>> - Offload portions of the root metadata file to external >>>>> files and keep references to them. >>>>> >>>>>>>> >>>>> >>>>>>>> Thoughts? >>>>> >>>>>>>> >>>>> >>>>>>>> - Anton >>>>> >>>>>>>> >>>>> >>>>>>>> >>>>> >>>>
