Happy new year everyone, I just wanted to bump this thread (most discussion has been happening on the doc [1]) in case it was missed over the holidays.
Thanks, Micah [1] https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw On Fri, Dec 19, 2025 at 2:14 PM Micah Kornfield <[email protected]> wrote: > Sounds good, will wait until next year. > > On Fri, Dec 19, 2025 at 2:13 PM Steven Wu <[email protected]> wrote: > >> Micah, many people will be OOO in the next two weeks. Can we extend the >> feedback deadline to at least 1-2 weeks after the new year? >> >> On Fri, Dec 19, 2025 at 8:45 AM Micah Kornfield <[email protected]> >> wrote: >> >>> > I have no problem with adding this discussion to the single file work, >>> but I'm not sure that would speed it up? Seems like this is a pretty >>> independent addition to the metadata layout? >>> >>> Yes, it is fairly independent. The main reason I wanted to consolidate >>> in the doc, it appears there is a bit of metadata re-arrangement and new >>> fields. I wanted to make sure that: >>> >>> 1. We avoid field ID conflicts. >>> 2. When writing up the final spec changes it is easy to manage and not >>> create a dependency one way or another between the two of these. >>> >>> Happy to keep the implementation of the guard-rails as a separate piece >>> of work. >>> >>> Cheers, >>> Micah >>> >>> On Fri, Dec 19, 2025 at 7:31 AM Russell Spitzer < >>> [email protected]> wrote: >>> >>>> I have no problem with adding this discussion to the single file work, >>>> but I'm not sure that would speed it up? Seems like this is a pretty >>>> independent addition to the metadata layout? >>>> >>>> On Thu, Dec 18, 2025 at 6:28 PM Micah Kornfield <[email protected]> >>>> wrote: >>>> >>>>> Thanks for the clarification, Micah! I want to explicitly call out >>>>>> (and double-confirm) the key principle here: all tags must be strictly >>>>>> optional and never required for correctness or basic functionality. >>>>>> Engines >>>>>> should always be able to safely drop or ignore tags without breaking >>>>>> reads >>>>>> or writes, with the only possible impact being suboptimal behavior (e.g., >>>>>> extra I/O), as you described. >>>>> >>>>> >>>>> 100% I will also add this summary to the bottom of the requirements >>>>> section. >>>>> >>>>> Based on mailing list discussion and doc comments (or lack thereof), >>>>> it does not seem like there are strong objections to adding this for V4. >>>>> Prashant seemed to maybe have concerns, so I'd like to understand if they >>>>> are blockers. >>>>> >>>>> If there isn't additional feedback by the end of next week, I'd like >>>>> to assume a lazy consensus and consolidate this with the single file >>>>> improvement work, which has already reorganized the metadata schema [1]. >>>>> Please let me know if there is a different process. >>>>> >>>>> Thanks, >>>>> Micah >>>>> >>>>> [1] >>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw >>>>> >>>>> On Wed, Dec 17, 2025 at 5:38 PM Yufei Gu <[email protected]> wrote: >>>>> >>>>>> Thanks for the clarification, Micah! I want to explicitly call out >>>>>> (and double-confirm) the key principle here: all tags must be strictly >>>>>> optional and never required for correctness or basic functionality. >>>>>> Engines >>>>>> should always be able to safely drop or ignore tags without breaking >>>>>> reads >>>>>> or writes, with the only possible impact being suboptimal behavior (e.g., >>>>>> extra I/O), as you described. >>>>>> >>>>>> As long as this constraint is clearly stated and enforced, the >>>>>> trade-off feels reasonable to me. >>>>>> >>>>>> Yufei >>>>>> >>>>>> >>>>>> On Mon, Dec 15, 2025 at 4:28 PM Micah Kornfield < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi Yufei, >>>>>>> >>>>>>>> If one engine started to rely on a tag for certain reasons(like >>>>>>>> clustering algorithm), would data file rewrite(compaction) by another >>>>>>>> engine remove the tag, and break the engine relying on it. >>>>>>> >>>>>>> >>>>>>> The intent here is that dropping tags should never break an engine. >>>>>>> But it could cause suboptimal operations. For instance, one example I >>>>>>> brought in the docs is using tags to cache parquet footer size, to make >>>>>>> sure it is fetched in 1 I/O. >>>>>>> >>>>>>> In this case the following would occur. >>>>>>> >>>>>>> 1. Engine 1 does a write to file 1 and records its footer size in >>>>>>> tags. >>>>>>> 2. Engine 2 does a rewrite/compactions and produces File 2 without >>>>>>> tags. >>>>>>> 3. Engine 1 then tries to read file 2. The tag for footer length >>>>>>> is missing so it falls back reading a reasonable number of bytes from >>>>>>> the >>>>>>> end of the parquet file, hoping the entire footer is retrieved (and if >>>>>>> it >>>>>>> isn't a second I/O is necessary). >>>>>>> >>>>>>> Similarly for clustering algorithms, I think the result could yield >>>>>>> a sub-optimally clustered table, or perhaps redundant clustering >>>>>>> operations >>>>>>> but shouldn't break anything. This is no worse then the case today >>>>>>> though >>>>>>> if engine 1 and engine 2 have different clustering algorithms and they >>>>>>> are >>>>>>> being run in interleaved fashion on the same table. In this case it is >>>>>>> highly likely that some amount of duplicate compaction is happening. >>>>>>> >>>>>>> In the current proposal, any metadata that is required for proper >>>>>>> functioning should never be put in tags. >>>>>>> >>>>>>> Thanks, >>>>>>> Micah >>>>>>> >>>>>>> >>>>>>> On Mon, Dec 15, 2025 at 4:02 PM Yufei Gu <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks for the proposal! >>>>>>>> >>>>>>>> If one engine started to rely on a tag for certain reasons(like >>>>>>>> clustering algorithm), would data file rewrite(compaction) by another >>>>>>>> engine remove the tag, and break the engine relying on it. >>>>>>>> >>>>>>>> Yufei >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Dec 10, 2025 at 2:58 PM Micah Kornfield < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hi Iceberg Dev, >>>>>>>>> I added a proposal [1] to add a key-value tags field for files in >>>>>>>>> V4 metadata [2]. More details are in the document but the intent is >>>>>>>>> to >>>>>>>>> allow engines to store optional metadata associated with these files: >>>>>>>>> >>>>>>>>> 1. The proposed field is optional and cannot be used for metadata >>>>>>>>> required for reading the table correctly. >>>>>>>>> 2. It also proposes guard-rails for not letting tags cause >>>>>>>>> metadata bloat. >>>>>>>>> >>>>>>>>> Looking forward to hearing everyone's thoughts and feedback. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Micah >>>>>>>>> >>>>>>>>> [1] https://github.com/apache/iceberg/issues/14815 >>>>>>>>> [2] >>>>>>>>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz >>>>>>>>> >>>>>>>>>
