Happy new year Micah. Are you linking the wrong doc (Iceberg Single File Commits) ? I think you are referring to https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz
Best, Manu On Tue, Jan 6, 2026 at 2:19 AM Micah Kornfield <[email protected]> wrote: > Happy new year everyone, I just wanted to bump this thread (most > discussion has been happening on the doc [1]) in case it was missed over > the holidays. > > Thanks, > Micah > > [1] > https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw > > On Fri, Dec 19, 2025 at 2:14 PM Micah Kornfield <[email protected]> > wrote: > >> Sounds good, will wait until next year. >> >> On Fri, Dec 19, 2025 at 2:13 PM Steven Wu <[email protected]> wrote: >> >>> Micah, many people will be OOO in the next two weeks. Can we extend the >>> feedback deadline to at least 1-2 weeks after the new year? >>> >>> On Fri, Dec 19, 2025 at 8:45 AM Micah Kornfield <[email protected]> >>> wrote: >>> >>>> > I have no problem with adding this discussion to the single file >>>> work, but I'm not sure that would speed it up? Seems like this is a pretty >>>> independent addition to the metadata layout? >>>> >>>> Yes, it is fairly independent. The main reason I wanted to consolidate >>>> in the doc, it appears there is a bit of metadata re-arrangement and new >>>> fields. I wanted to make sure that: >>>> >>>> 1. We avoid field ID conflicts. >>>> 2. When writing up the final spec changes it is easy to manage and not >>>> create a dependency one way or another between the two of these. >>>> >>>> Happy to keep the implementation of the guard-rails as a separate piece >>>> of work. >>>> >>>> Cheers, >>>> Micah >>>> >>>> On Fri, Dec 19, 2025 at 7:31 AM Russell Spitzer < >>>> [email protected]> wrote: >>>> >>>>> I have no problem with adding this discussion to the single file work, >>>>> but I'm not sure that would speed it up? Seems like this is a pretty >>>>> independent addition to the metadata layout? >>>>> >>>>> On Thu, Dec 18, 2025 at 6:28 PM Micah Kornfield <[email protected]> >>>>> wrote: >>>>> >>>>>> Thanks for the clarification, Micah! I want to explicitly call out >>>>>>> (and double-confirm) the key principle here: all tags must be strictly >>>>>>> optional and never required for correctness or basic functionality. >>>>>>> Engines >>>>>>> should always be able to safely drop or ignore tags without breaking >>>>>>> reads >>>>>>> or writes, with the only possible impact being suboptimal behavior >>>>>>> (e.g., >>>>>>> extra I/O), as you described. >>>>>> >>>>>> >>>>>> 100% I will also add this summary to the bottom of the requirements >>>>>> section. >>>>>> >>>>>> Based on mailing list discussion and doc comments (or lack thereof), >>>>>> it does not seem like there are strong objections to adding this for V4. >>>>>> Prashant seemed to maybe have concerns, so I'd like to understand if they >>>>>> are blockers. >>>>>> >>>>>> If there isn't additional feedback by the end of next week, I'd like >>>>>> to assume a lazy consensus and consolidate this with the single file >>>>>> improvement work, which has already reorganized the metadata schema [1]. >>>>>> Please let me know if there is a different process. >>>>>> >>>>>> Thanks, >>>>>> Micah >>>>>> >>>>>> [1] >>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw >>>>>> >>>>>> On Wed, Dec 17, 2025 at 5:38 PM Yufei Gu <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Thanks for the clarification, Micah! I want to explicitly call out >>>>>>> (and double-confirm) the key principle here: all tags must be strictly >>>>>>> optional and never required for correctness or basic functionality. >>>>>>> Engines >>>>>>> should always be able to safely drop or ignore tags without breaking >>>>>>> reads >>>>>>> or writes, with the only possible impact being suboptimal behavior >>>>>>> (e.g., >>>>>>> extra I/O), as you described. >>>>>>> >>>>>>> As long as this constraint is clearly stated and enforced, the >>>>>>> trade-off feels reasonable to me. >>>>>>> >>>>>>> Yufei >>>>>>> >>>>>>> >>>>>>> On Mon, Dec 15, 2025 at 4:28 PM Micah Kornfield < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi Yufei, >>>>>>>> >>>>>>>>> If one engine started to rely on a tag for certain reasons(like >>>>>>>>> clustering algorithm), would data file rewrite(compaction) by another >>>>>>>>> engine remove the tag, and break the engine relying on it. >>>>>>>> >>>>>>>> >>>>>>>> The intent here is that dropping tags should never break an >>>>>>>> engine. But it could cause suboptimal operations. For instance, one >>>>>>>> example I brought in the docs is using tags to cache parquet footer >>>>>>>> size, >>>>>>>> to make sure it is fetched in 1 I/O. >>>>>>>> >>>>>>>> In this case the following would occur. >>>>>>>> >>>>>>>> 1. Engine 1 does a write to file 1 and records its footer size in >>>>>>>> tags. >>>>>>>> 2. Engine 2 does a rewrite/compactions and produces File 2 without >>>>>>>> tags. >>>>>>>> 3. Engine 1 then tries to read file 2. The tag for footer length >>>>>>>> is missing so it falls back reading a reasonable number of bytes from >>>>>>>> the >>>>>>>> end of the parquet file, hoping the entire footer is retrieved (and if >>>>>>>> it >>>>>>>> isn't a second I/O is necessary). >>>>>>>> >>>>>>>> Similarly for clustering algorithms, I think the result could yield >>>>>>>> a sub-optimally clustered table, or perhaps redundant clustering >>>>>>>> operations >>>>>>>> but shouldn't break anything. This is no worse then the case today >>>>>>>> though >>>>>>>> if engine 1 and engine 2 have different clustering algorithms and they >>>>>>>> are >>>>>>>> being run in interleaved fashion on the same table. In this case it is >>>>>>>> highly likely that some amount of duplicate compaction is happening. >>>>>>>> >>>>>>>> In the current proposal, any metadata that is required for proper >>>>>>>> functioning should never be put in tags. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Micah >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Dec 15, 2025 at 4:02 PM Yufei Gu <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks for the proposal! >>>>>>>>> >>>>>>>>> If one engine started to rely on a tag for certain reasons(like >>>>>>>>> clustering algorithm), would data file rewrite(compaction) by another >>>>>>>>> engine remove the tag, and break the engine relying on it. >>>>>>>>> >>>>>>>>> Yufei >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Dec 10, 2025 at 2:58 PM Micah Kornfield < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hi Iceberg Dev, >>>>>>>>>> I added a proposal [1] to add a key-value tags field for files in >>>>>>>>>> V4 metadata [2]. More details are in the document but the intent is >>>>>>>>>> to >>>>>>>>>> allow engines to store optional metadata associated with these files: >>>>>>>>>> >>>>>>>>>> 1. The proposed field is optional and cannot be used for >>>>>>>>>> metadata required for reading the table correctly. >>>>>>>>>> 2. It also proposes guard-rails for not letting tags cause >>>>>>>>>> metadata bloat. >>>>>>>>>> >>>>>>>>>> Looking forward to hearing everyone's thoughts and feedback. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Micah >>>>>>>>>> >>>>>>>>>> [1] https://github.com/apache/iceberg/issues/14815 >>>>>>>>>> [2] >>>>>>>>>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz >>>>>>>>>> >>>>>>>>>>
