Hi Prashant, I unfortunately, I have conflicts on Wednesdays for the foreseeable future at that time. Hopefully between the sync and mailing list we can figure out a path forward. If anybody else has feedback please add it to the Google doc or reply to the thread and I can address it.
Thanks, Micah On Thursday, March 26, 2026, Prashant Singh <[email protected]> wrote: > Thank you for being flexible Micah, how about we add this to the agenda > item in iceberg community sync which is just a day after at 9 pm, a lot of > folks join and we will have better participation. > and it seems like we would have time to talk since i see the agenda is > still open, if we can't conclude we can have a dedicated sync for it. > > Best, > Prashant Singh > > On Thu, Mar 26, 2026 at 3:23 PM Micah Kornfield <[email protected]> > wrote: > >> Thanks Kevin for accepting. Thanks for your feedback Prashant, since you >> have been active reviewing, I moved the event to Tuesday at a time that you >> mentioned you would be available, hopefully this doesn't exclude anybody >> else who wants to join the conversation. >> >> Thanks, >> Micah >> >> On Thu, Mar 26, 2026 at 9:52 AM Prashant Singh <[email protected]> >> wrote: >> >>> Thanks for bumping this thread Micah and thank you for all the work ! I >>> missed this thread completely, apologies for that, I have so far been >>> responding to the design docs (would be nice to link ML to doc too). >>> >>> For the feedback, I am not supportive of this proposal and I am looking >>> forward to hear from other community members on despite these severe con >>> why we should be doing it specially given we have clear aligned path on >>> how to introduce these by in backward compatible way >>> >>> Here are my reservations : >>> 1/ while the proposal says one can limit the default size 512B, it says >>> it is configurable, this can severely impact the number of entries we can >>> have in a manifest file, we went through the whole exercise of whether we >>> should have inline manifest dv or not, and based on tradeoff we concluded >>> one over the other. Giving this much of size in the worst case per data >>> file inside the manifest can severely impact the query planning time and >>> query execution cost (will more IO) of the iceberg readers which may be >>> different than who produced the iceberg data set. >>> 2/ It works on an assumption we need to do spec version bump to add new >>> fields, which i think is not completely true we added things like partition >>> stats / statistic field as optional, i don't understand why cant we do the >>> same, specially with things like schema_id and footer_size mentioned as >>> motivation. I think the community >>> was pretty aligned to have schema_id as optional field to have writer >>> backward compatibility as all new writers taking the benefit of this [1] >>> 3/ one of motivations thats is stated is to support Vendors proprietary >>> metadata for supporting their proprietary clustering algorithm, this to me >>> looks like a way to work around spec to let iceberg metadata layout carry >>> these info which doesn't means anything to iceberg ecosystem and can >>> compromise interoperability. >>> Also think of a case where Vendor A starts producing something >>> partnering with Vendor B and to make things worse encrypt it and not let >>> vendor C not in this partnership see it. IMHO we should not open up new >>> ways that hurt the interop. >>> >>> I also want to thank you for proposing the meeting, unfortunately the >>> proposed time doesn't work for me, i have a conflicting meeting, please >>> feel free to proceed without me, I can watch the recording later as well, >>> as far as my support is concerned I look forward to answers that strongly >>> supporting this use case and why should we be ok accepting these cons given >>> we already had a well thought path to move forward. >>> >>> [1] https://github.com/apache/iceberg/pull/4898 >>> >>> Best, >>> Prashant Singh >>> >>> >>> >>> On Wed, Mar 25, 2026 at 3:22 PM Kevin Liu <[email protected]> wrote: >>> >>>> I added/accepted on the dev calendar. Looking forward to it! >>>> >>>> On Tue, Mar 24, 2026 at 5:34 PM Micah Kornfield <[email protected]> >>>> wrote: >>>> >>>>> It seems like we might not have full alignment on this proposal, I >>>>> tentatively scheduled a sync for next Monday (added to the iceberg dev >>>>> events calendar). Please let me know if you are interested in joining and >>>>> the time doesn't work for you (we can reschedule accordingly). >>>>> >>>>> Thanks, >>>>> Micah >>>>> >>>>> On 2026/02/09 23:15:49 Micah Kornfield wrote: >>>>> > As an update I've made the proposal to add this field to the Single >>>>> file >>>>> > commits doc. >>>>> > >>>>> > Please let me know if there is any additional feedback. >>>>> > >>>>> > Thanks, >>>>> > Micah >>>>> > >>>>> > On Wed, Jan 21, 2026 at 5:16 PM Micah Kornfield < >>>>> [email protected]> >>>>> > wrote: >>>>> > >>>>> > > Thanks Manu, that is the right doc. >>>>> > > >>>>> > > As an update, I've incorporated feedback from the community to the >>>>> > > document: >>>>> > > >>>>> > > At a high level the changes are: >>>>> > > - Renamed the field from "tags" to "attributes" >>>>> > > - Clarified limits on attributes should only be enforced for new >>>>> data. >>>>> > > Existing tags must always be carried through. >>>>> > > - Added more details on enforcing size of tags. >>>>> > > >>>>> > > Are there any objections to folding the proposal into the V4 >>>>> metadata >>>>> > > proposal? Again, the reasons for doing so are mostly around >>>>> ensuring >>>>> > > consistent field numbering and making the spec update easier. >>>>> > > >>>>> > > If people want further discussion on this I'd be happy to discuss >>>>> at the >>>>> > > next V4 metadata sync or create a one-off meeting. Please let me >>>>> know. >>>>> > > >>>>> > > Thanks, >>>>> > > Micah >>>>> > > >>>>> > > On Mon, Jan 5, 2026 at 5:48 PM Manu Zhang <[email protected]> >>>>> wrote: >>>>> > > >>>>> > >> Happy new year Micah. Are you linking the wrong doc (Iceberg >>>>> Single File >>>>> > >> Commits) ? >>>>> > >> I think you are referring to >>>>> > >> https://docs.google.com/document/d/16flxDXjpBiAs_ >>>>> cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz >>>>> > >> >>>>> > >> Best, >>>>> > >> Manu >>>>> > >> >>>>> > >> On Tue, Jan 6, 2026 at 2:19 AM Micah Kornfield < >>>>> [email protected]> >>>>> > >> wrote: >>>>> > >> >>>>> > >>> Happy new year everyone, I just wanted to bump this thread (most >>>>> > >>> discussion has been happening on the doc [1]) in case it was >>>>> missed over >>>>> > >>> the holidays. >>>>> > >>> >>>>> > >>> Thanks, >>>>> > >>> Micah >>>>> > >>> >>>>> > >>> [1] >>>>> > >>> https://docs.google.com/document/d/ >>>>> 1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0# >>>>> heading=h.unn922df0zzw >>>>> > >>> >>>>> > >>> On Fri, Dec 19, 2025 at 2:14 PM Micah Kornfield < >>>>> [email protected]> >>>>> > >>> wrote: >>>>> > >>> >>>>> > >>>> Sounds good, will wait until next year. >>>>> > >>>> >>>>> > >>>> On Fri, Dec 19, 2025 at 2:13 PM Steven Wu <[email protected]> >>>>> wrote: >>>>> > >>>> >>>>> > >>>>> Micah, many people will be OOO in the next two weeks. Can we >>>>> extend >>>>> > >>>>> the feedback deadline to at least 1-2 weeks after the new year? >>>>> > >>>>> >>>>> > >>>>> On Fri, Dec 19, 2025 at 8:45 AM Micah Kornfield < >>>>> [email protected]> >>>>> > >>>>> wrote: >>>>> > >>>>> >>>>> > >>>>>> > I have no problem with adding this discussion to the single >>>>> file >>>>> > >>>>>> work, but I'm not sure that would speed it up? Seems like >>>>> this is a pretty >>>>> > >>>>>> independent addition to the metadata layout? >>>>> > >>>>>> >>>>> > >>>>>> Yes, it is fairly independent. The main reason I wanted to >>>>> > >>>>>> consolidate in the doc, it appears there is a bit of metadata >>>>> > >>>>>> re-arrangement and new fields. I wanted to make sure that: >>>>> > >>>>>> >>>>> > >>>>>> 1. We avoid field ID conflicts. >>>>> > >>>>>> 2. When writing up the final spec changes it is easy to >>>>> manage and >>>>> > >>>>>> not create a dependency one way or another between the two of >>>>> these. >>>>> > >>>>>> >>>>> > >>>>>> Happy to keep the implementation of the guard-rails as a >>>>> separate >>>>> > >>>>>> piece of work. >>>>> > >>>>>> >>>>> > >>>>>> Cheers, >>>>> > >>>>>> Micah >>>>> > >>>>>> >>>>> > >>>>>> On Fri, Dec 19, 2025 at 7:31 AM Russell Spitzer < >>>>> > >>>>>> [email protected]> wrote: >>>>> > >>>>>> >>>>> > >>>>>>> I have no problem with adding this discussion to the single >>>>> file >>>>> > >>>>>>> work, but I'm not sure that would speed it up? Seems like >>>>> this is a pretty >>>>> > >>>>>>> independent addition to the metadata layout? >>>>> > >>>>>>> >>>>> > >>>>>>> On Thu, Dec 18, 2025 at 6:28 PM Micah Kornfield < >>>>> > >>>>>>> [email protected]> wrote: >>>>> > >>>>>>> >>>>> > >>>>>>>> Thanks for the clarification, Micah! I want to explicitly >>>>> call out >>>>> > >>>>>>>>> (and double-confirm) the key principle here: all tags must >>>>> be strictly >>>>> > >>>>>>>>> optional and never required for correctness or basic >>>>> functionality. Engines >>>>> > >>>>>>>>> should always be able to safely drop or ignore tags >>>>> without breaking reads >>>>> > >>>>>>>>> or writes, with the only possible impact being suboptimal >>>>> behavior (e.g., >>>>> > >>>>>>>>> extra I/O), as you described. >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>>>> 100% I will also add this summary to the bottom of the >>>>> requirements >>>>> > >>>>>>>> section. >>>>> > >>>>>>>> >>>>> > >>>>>>>> Based on mailing list discussion and doc comments (or lack >>>>> > >>>>>>>> thereof), it does not seem like there are strong objections >>>>> to adding this >>>>> > >>>>>>>> for V4. Prashant seemed to maybe have concerns, so I'd >>>>> like to understand >>>>> > >>>>>>>> if they are blockers. >>>>> > >>>>>>>> >>>>> > >>>>>>>> If there isn't additional feedback by the end of next week, >>>>> I'd >>>>> > >>>>>>>> like to assume a lazy consensus and consolidate this with >>>>> the single file >>>>> > >>>>>>>> improvement work, which has already reorganized the >>>>> metadata schema [1]. >>>>> > >>>>>>>> Please let me know if there is a different process. >>>>> > >>>>>>>> >>>>> > >>>>>>>> Thanks, >>>>> > >>>>>>>> Micah >>>>> > >>>>>>>> >>>>> > >>>>>>>> [1] >>>>> > >>>>>>>> https://docs.google.com/document/d/ >>>>> 1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0# >>>>> heading=h.unn922df0zzw >>>>> > >>>>>>>> >>>>> > >>>>>>>> On Wed, Dec 17, 2025 at 5:38 PM Yufei Gu < >>>>> [email protected]> >>>>> > >>>>>>>> wrote: >>>>> > >>>>>>>> >>>>> > >>>>>>>>> Thanks for the clarification, Micah! I want to explicitly >>>>> call out >>>>> > >>>>>>>>> (and double-confirm) the key principle here: all tags must >>>>> be strictly >>>>> > >>>>>>>>> optional and never required for correctness or basic >>>>> functionality. Engines >>>>> > >>>>>>>>> should always be able to safely drop or ignore tags >>>>> without breaking reads >>>>> > >>>>>>>>> or writes, with the only possible impact being suboptimal >>>>> behavior (e.g., >>>>> > >>>>>>>>> extra I/O), as you described. >>>>> > >>>>>>>>> >>>>> > >>>>>>>>> As long as this constraint is clearly stated and enforced, >>>>> the >>>>> > >>>>>>>>> trade-off feels reasonable to me. >>>>> > >>>>>>>>> >>>>> > >>>>>>>>> Yufei >>>>> > >>>>>>>>> >>>>> > >>>>>>>>> >>>>> > >>>>>>>>> On Mon, Dec 15, 2025 at 4:28 PM Micah Kornfield < >>>>> > >>>>>>>>> [email protected]> wrote: >>>>> > >>>>>>>>> >>>>> > >>>>>>>>>> Hi Yufei, >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>>> If one engine started to rely on a tag for certain >>>>> reasons(like >>>>> > >>>>>>>>>>> clustering algorithm), would data file >>>>> rewrite(compaction) by another >>>>> > >>>>>>>>>>> engine remove the tag, and break the engine relying on >>>>> it. >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> The intent here is that dropping tags should never break >>>>> an >>>>> > >>>>>>>>>> engine. But it could cause suboptimal operations. For >>>>> instance, one >>>>> > >>>>>>>>>> example I brought in the docs is using tags to cache >>>>> parquet footer size, >>>>> > >>>>>>>>>> to make sure it is fetched in 1 I/O. >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> In this case the following would occur. >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> 1. Engine 1 does a write to file 1 and records its >>>>> footer size >>>>> > >>>>>>>>>> in tags. >>>>> > >>>>>>>>>> 2. Engine 2 does a rewrite/compactions and produces File >>>>> 2 >>>>> > >>>>>>>>>> without tags. >>>>> > >>>>>>>>>> 3. Engine 1 then tries to read file 2. The tag for >>>>> footer >>>>> > >>>>>>>>>> length is missing so it falls back reading a reasonable >>>>> number of bytes >>>>> > >>>>>>>>>> from the end of the parquet file, hoping the entire >>>>> footer is retrieved >>>>> > >>>>>>>>>> (and if it isn't a second I/O is necessary). >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> Similarly for clustering algorithms, I think the result >>>>> could >>>>> > >>>>>>>>>> yield a sub-optimally clustered table, or perhaps >>>>> redundant clustering >>>>> > >>>>>>>>>> operations but shouldn't break anything. This is no worse >>>>> then the case >>>>> > >>>>>>>>>> today though if engine 1 and engine 2 have different >>>>> clustering algorithms >>>>> > >>>>>>>>>> and they are being run in interleaved fashion on the same >>>>> table. In this >>>>> > >>>>>>>>>> case it is highly likely that some amount of duplicate >>>>> compaction is >>>>> > >>>>>>>>>> happening. >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> In the current proposal, any metadata that is required >>>>> for proper >>>>> > >>>>>>>>>> functioning should never be put in tags. >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> Thanks, >>>>> > >>>>>>>>>> Micah >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> On Mon, Dec 15, 2025 at 4:02 PM Yufei Gu < >>>>> [email protected]> >>>>> > >>>>>>>>>> wrote: >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>>> Thanks for the proposal! >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> If one engine started to rely on a tag for certain >>>>> reasons(like >>>>> > >>>>>>>>>>> clustering algorithm), would data file >>>>> rewrite(compaction) by another >>>>> > >>>>>>>>>>> engine remove the tag, and break the engine relying on >>>>> it. >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> Yufei >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> On Wed, Dec 10, 2025 at 2:58 PM Micah Kornfield < >>>>> > >>>>>>>>>>> [email protected]> wrote: >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>>> Hi Iceberg Dev, >>>>> > >>>>>>>>>>>> I added a proposal [1] to add a key-value tags field >>>>> for files >>>>> > >>>>>>>>>>>> in V4 metadata [2]. More details are in the document >>>>> but the intent is to >>>>> > >>>>>>>>>>>> allow engines to store optional metadata associated >>>>> with these files: >>>>> > >>>>>>>>>>>> >>>>> > >>>>>>>>>>>> 1. The proposed field is optional and cannot be used >>>>> for >>>>> > >>>>>>>>>>>> metadata required for reading the table correctly. >>>>> > >>>>>>>>>>>> 2. It also proposes guard-rails for not letting tags >>>>> cause >>>>> > >>>>>>>>>>>> metadata bloat. >>>>> > >>>>>>>>>>>> >>>>> > >>>>>>>>>>>> Looking forward to hearing everyone's thoughts and >>>>> feedback. >>>>> > >>>>>>>>>>>> >>>>> > >>>>>>>>>>>> Thanks, >>>>> > >>>>>>>>>>>> Micah >>>>> > >>>>>>>>>>>> >>>>> > >>>>>>>>>>>> [1] https://github.com/apache/iceberg/issues/14815 >>>>> > >>>>>>>>>>>> [2] >>>>> > >>>>>>>>>>>> https://docs.google.com/document/d/16flxDXjpBiAs_ >>>>> cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz >>>>> > >>>>>>>>>>>> >>>>> > >>>>>>>>>>>> >>>>> > >>>>> >>>>
