Thanks Kevin for accepting. Thanks for your feedback Prashant, since you have been active reviewing, I moved the event to Tuesday at a time that you mentioned you would be available, hopefully this doesn't exclude anybody else who wants to join the conversation.
Thanks, Micah On Thu, Mar 26, 2026 at 9:52 AM Prashant Singh <[email protected]> wrote: > Thanks for bumping this thread Micah and thank you for all the work ! I > missed this thread completely, apologies for that, I have so far been > responding to the design docs (would be nice to link ML to doc too). > > For the feedback, I am not supportive of this proposal and I am looking > forward to hear from other community members on despite these severe con > why we should be doing it specially given we have clear aligned path on > how to introduce these by in backward compatible way > > Here are my reservations : > 1/ while the proposal says one can limit the default size 512B, it says it > is configurable, this can severely impact the number of entries we can have > in a manifest file, we went through the whole exercise of whether we > should have inline manifest dv or not, and based on tradeoff we concluded > one over the other. Giving this much of size in the worst case per data > file inside the manifest can severely impact the query planning time and > query execution cost (will more IO) of the iceberg readers which may be > different than who produced the iceberg data set. > 2/ It works on an assumption we need to do spec version bump to add new > fields, which i think is not completely true we added things like partition > stats / statistic field as optional, i don't understand why cant we do the > same, specially with things like schema_id and footer_size mentioned as > motivation. I think the community > was pretty aligned to have schema_id as optional field to have writer > backward compatibility as all new writers taking the benefit of this [1] > 3/ one of motivations thats is stated is to support Vendors proprietary > metadata for supporting their proprietary clustering algorithm, this to me > looks like a way to work around spec to let iceberg metadata layout carry > these info which doesn't means anything to iceberg ecosystem and can > compromise interoperability. > Also think of a case where Vendor A starts producing something > partnering with Vendor B and to make things worse encrypt it and not let > vendor C not in this partnership see it. IMHO we should not open up new > ways that hurt the interop. > > I also want to thank you for proposing the meeting, unfortunately the > proposed time doesn't work for me, i have a conflicting meeting, please > feel free to proceed without me, I can watch the recording later as well, > as far as my support is concerned I look forward to answers that strongly > supporting this use case and why should we be ok accepting these cons given > we already had a well thought path to move forward. > > [1] https://github.com/apache/iceberg/pull/4898 > > Best, > Prashant Singh > > > > On Wed, Mar 25, 2026 at 3:22 PM Kevin Liu <[email protected]> wrote: > >> I added/accepted on the dev calendar. Looking forward to it! >> >> On Tue, Mar 24, 2026 at 5:34 PM Micah Kornfield <[email protected]> >> wrote: >> >>> It seems like we might not have full alignment on this proposal, I >>> tentatively scheduled a sync for next Monday (added to the iceberg dev >>> events calendar). Please let me know if you are interested in joining and >>> the time doesn't work for you (we can reschedule accordingly). >>> >>> Thanks, >>> Micah >>> >>> On 2026/02/09 23:15:49 Micah Kornfield wrote: >>> > As an update I've made the proposal to add this field to the Single >>> file >>> > commits doc. >>> > >>> > Please let me know if there is any additional feedback. >>> > >>> > Thanks, >>> > Micah >>> > >>> > On Wed, Jan 21, 2026 at 5:16 PM Micah Kornfield <[email protected] >>> > >>> > wrote: >>> > >>> > > Thanks Manu, that is the right doc. >>> > > >>> > > As an update, I've incorporated feedback from the community to the >>> > > document: >>> > > >>> > > At a high level the changes are: >>> > > - Renamed the field from "tags" to "attributes" >>> > > - Clarified limits on attributes should only be enforced for new >>> data. >>> > > Existing tags must always be carried through. >>> > > - Added more details on enforcing size of tags. >>> > > >>> > > Are there any objections to folding the proposal into the V4 metadata >>> > > proposal? Again, the reasons for doing so are mostly around ensuring >>> > > consistent field numbering and making the spec update easier. >>> > > >>> > > If people want further discussion on this I'd be happy to discuss at >>> the >>> > > next V4 metadata sync or create a one-off meeting. Please let me >>> know. >>> > > >>> > > Thanks, >>> > > Micah >>> > > >>> > > On Mon, Jan 5, 2026 at 5:48 PM Manu Zhang <[email protected]> >>> wrote: >>> > > >>> > >> Happy new year Micah. Are you linking the wrong doc (Iceberg Single >>> File >>> > >> Commits) ? >>> > >> I think you are referring to >>> > >> >>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz >>> > >> >>> > >> Best, >>> > >> Manu >>> > >> >>> > >> On Tue, Jan 6, 2026 at 2:19 AM Micah Kornfield < >>> [email protected]> >>> > >> wrote: >>> > >> >>> > >>> Happy new year everyone, I just wanted to bump this thread (most >>> > >>> discussion has been happening on the doc [1]) in case it was >>> missed over >>> > >>> the holidays. >>> > >>> >>> > >>> Thanks, >>> > >>> Micah >>> > >>> >>> > >>> [1] >>> > >>> >>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw >>> > >>> >>> > >>> On Fri, Dec 19, 2025 at 2:14 PM Micah Kornfield < >>> [email protected]> >>> > >>> wrote: >>> > >>> >>> > >>>> Sounds good, will wait until next year. >>> > >>>> >>> > >>>> On Fri, Dec 19, 2025 at 2:13 PM Steven Wu <[email protected]> >>> wrote: >>> > >>>> >>> > >>>>> Micah, many people will be OOO in the next two weeks. Can we >>> extend >>> > >>>>> the feedback deadline to at least 1-2 weeks after the new year? >>> > >>>>> >>> > >>>>> On Fri, Dec 19, 2025 at 8:45 AM Micah Kornfield < >>> [email protected]> >>> > >>>>> wrote: >>> > >>>>> >>> > >>>>>> > I have no problem with adding this discussion to the single >>> file >>> > >>>>>> work, but I'm not sure that would speed it up? Seems like this >>> is a pretty >>> > >>>>>> independent addition to the metadata layout? >>> > >>>>>> >>> > >>>>>> Yes, it is fairly independent. The main reason I wanted to >>> > >>>>>> consolidate in the doc, it appears there is a bit of metadata >>> > >>>>>> re-arrangement and new fields. I wanted to make sure that: >>> > >>>>>> >>> > >>>>>> 1. We avoid field ID conflicts. >>> > >>>>>> 2. When writing up the final spec changes it is easy to manage >>> and >>> > >>>>>> not create a dependency one way or another between the two of >>> these. >>> > >>>>>> >>> > >>>>>> Happy to keep the implementation of the guard-rails as a >>> separate >>> > >>>>>> piece of work. >>> > >>>>>> >>> > >>>>>> Cheers, >>> > >>>>>> Micah >>> > >>>>>> >>> > >>>>>> On Fri, Dec 19, 2025 at 7:31 AM Russell Spitzer < >>> > >>>>>> [email protected]> wrote: >>> > >>>>>> >>> > >>>>>>> I have no problem with adding this discussion to the single >>> file >>> > >>>>>>> work, but I'm not sure that would speed it up? Seems like this >>> is a pretty >>> > >>>>>>> independent addition to the metadata layout? >>> > >>>>>>> >>> > >>>>>>> On Thu, Dec 18, 2025 at 6:28 PM Micah Kornfield < >>> > >>>>>>> [email protected]> wrote: >>> > >>>>>>> >>> > >>>>>>>> Thanks for the clarification, Micah! I want to explicitly >>> call out >>> > >>>>>>>>> (and double-confirm) the key principle here: all tags must >>> be strictly >>> > >>>>>>>>> optional and never required for correctness or basic >>> functionality. Engines >>> > >>>>>>>>> should always be able to safely drop or ignore tags without >>> breaking reads >>> > >>>>>>>>> or writes, with the only possible impact being suboptimal >>> behavior (e.g., >>> > >>>>>>>>> extra I/O), as you described. >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> 100% I will also add this summary to the bottom of the >>> requirements >>> > >>>>>>>> section. >>> > >>>>>>>> >>> > >>>>>>>> Based on mailing list discussion and doc comments (or lack >>> > >>>>>>>> thereof), it does not seem like there are strong objections >>> to adding this >>> > >>>>>>>> for V4. Prashant seemed to maybe have concerns, so I'd like >>> to understand >>> > >>>>>>>> if they are blockers. >>> > >>>>>>>> >>> > >>>>>>>> If there isn't additional feedback by the end of next week, >>> I'd >>> > >>>>>>>> like to assume a lazy consensus and consolidate this with the >>> single file >>> > >>>>>>>> improvement work, which has already reorganized the metadata >>> schema [1]. >>> > >>>>>>>> Please let me know if there is a different process. >>> > >>>>>>>> >>> > >>>>>>>> Thanks, >>> > >>>>>>>> Micah >>> > >>>>>>>> >>> > >>>>>>>> [1] >>> > >>>>>>>> >>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw >>> > >>>>>>>> >>> > >>>>>>>> On Wed, Dec 17, 2025 at 5:38 PM Yufei Gu < >>> [email protected]> >>> > >>>>>>>> wrote: >>> > >>>>>>>> >>> > >>>>>>>>> Thanks for the clarification, Micah! I want to explicitly >>> call out >>> > >>>>>>>>> (and double-confirm) the key principle here: all tags must >>> be strictly >>> > >>>>>>>>> optional and never required for correctness or basic >>> functionality. Engines >>> > >>>>>>>>> should always be able to safely drop or ignore tags without >>> breaking reads >>> > >>>>>>>>> or writes, with the only possible impact being suboptimal >>> behavior (e.g., >>> > >>>>>>>>> extra I/O), as you described. >>> > >>>>>>>>> >>> > >>>>>>>>> As long as this constraint is clearly stated and enforced, >>> the >>> > >>>>>>>>> trade-off feels reasonable to me. >>> > >>>>>>>>> >>> > >>>>>>>>> Yufei >>> > >>>>>>>>> >>> > >>>>>>>>> >>> > >>>>>>>>> On Mon, Dec 15, 2025 at 4:28 PM Micah Kornfield < >>> > >>>>>>>>> [email protected]> wrote: >>> > >>>>>>>>> >>> > >>>>>>>>>> Hi Yufei, >>> > >>>>>>>>>> >>> > >>>>>>>>>>> If one engine started to rely on a tag for certain >>> reasons(like >>> > >>>>>>>>>>> clustering algorithm), would data file rewrite(compaction) >>> by another >>> > >>>>>>>>>>> engine remove the tag, and break the engine relying on it. >>> > >>>>>>>>>> >>> > >>>>>>>>>> >>> > >>>>>>>>>> The intent here is that dropping tags should never break an >>> > >>>>>>>>>> engine. But it could cause suboptimal operations. For >>> instance, one >>> > >>>>>>>>>> example I brought in the docs is using tags to cache >>> parquet footer size, >>> > >>>>>>>>>> to make sure it is fetched in 1 I/O. >>> > >>>>>>>>>> >>> > >>>>>>>>>> In this case the following would occur. >>> > >>>>>>>>>> >>> > >>>>>>>>>> 1. Engine 1 does a write to file 1 and records its footer >>> size >>> > >>>>>>>>>> in tags. >>> > >>>>>>>>>> 2. Engine 2 does a rewrite/compactions and produces File 2 >>> > >>>>>>>>>> without tags. >>> > >>>>>>>>>> 3. Engine 1 then tries to read file 2. The tag for footer >>> > >>>>>>>>>> length is missing so it falls back reading a reasonable >>> number of bytes >>> > >>>>>>>>>> from the end of the parquet file, hoping the entire footer >>> is retrieved >>> > >>>>>>>>>> (and if it isn't a second I/O is necessary). >>> > >>>>>>>>>> >>> > >>>>>>>>>> Similarly for clustering algorithms, I think the result >>> could >>> > >>>>>>>>>> yield a sub-optimally clustered table, or perhaps redundant >>> clustering >>> > >>>>>>>>>> operations but shouldn't break anything. This is no worse >>> then the case >>> > >>>>>>>>>> today though if engine 1 and engine 2 have different >>> clustering algorithms >>> > >>>>>>>>>> and they are being run in interleaved fashion on the same >>> table. In this >>> > >>>>>>>>>> case it is highly likely that some amount of duplicate >>> compaction is >>> > >>>>>>>>>> happening. >>> > >>>>>>>>>> >>> > >>>>>>>>>> In the current proposal, any metadata that is required for >>> proper >>> > >>>>>>>>>> functioning should never be put in tags. >>> > >>>>>>>>>> >>> > >>>>>>>>>> Thanks, >>> > >>>>>>>>>> Micah >>> > >>>>>>>>>> >>> > >>>>>>>>>> >>> > >>>>>>>>>> On Mon, Dec 15, 2025 at 4:02 PM Yufei Gu < >>> [email protected]> >>> > >>>>>>>>>> wrote: >>> > >>>>>>>>>> >>> > >>>>>>>>>>> Thanks for the proposal! >>> > >>>>>>>>>>> >>> > >>>>>>>>>>> If one engine started to rely on a tag for certain >>> reasons(like >>> > >>>>>>>>>>> clustering algorithm), would data file rewrite(compaction) >>> by another >>> > >>>>>>>>>>> engine remove the tag, and break the engine relying on it. >>> > >>>>>>>>>>> >>> > >>>>>>>>>>> Yufei >>> > >>>>>>>>>>> >>> > >>>>>>>>>>> >>> > >>>>>>>>>>> On Wed, Dec 10, 2025 at 2:58 PM Micah Kornfield < >>> > >>>>>>>>>>> [email protected]> wrote: >>> > >>>>>>>>>>> >>> > >>>>>>>>>>>> Hi Iceberg Dev, >>> > >>>>>>>>>>>> I added a proposal [1] to add a key-value tags field for >>> files >>> > >>>>>>>>>>>> in V4 metadata [2]. More details are in the document but >>> the intent is to >>> > >>>>>>>>>>>> allow engines to store optional metadata associated with >>> these files: >>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> 1. The proposed field is optional and cannot be used for >>> > >>>>>>>>>>>> metadata required for reading the table correctly. >>> > >>>>>>>>>>>> 2. It also proposes guard-rails for not letting tags >>> cause >>> > >>>>>>>>>>>> metadata bloat. >>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> Looking forward to hearing everyone's thoughts and >>> feedback. >>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> Thanks, >>> > >>>>>>>>>>>> Micah >>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> [1] https://github.com/apache/iceberg/issues/14815 >>> > >>>>>>>>>>>> [2] >>> > >>>>>>>>>>>> >>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz >>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> >>> > >>> >>
