Thank you for being flexible Micah, how about we add this to the agenda item in iceberg community sync which is just a day after at 9 pm, a lot of folks join and we will have better participation. and it seems like we would have time to talk since i see the agenda is still open, if we can't conclude we can have a dedicated sync for it.
Best, Prashant Singh On Thu, Mar 26, 2026 at 3:23 PM Micah Kornfield <[email protected]> wrote: > Thanks Kevin for accepting. Thanks for your feedback Prashant, since you > have been active reviewing, I moved the event to Tuesday at a time that you > mentioned you would be available, hopefully this doesn't exclude anybody > else who wants to join the conversation. > > Thanks, > Micah > > On Thu, Mar 26, 2026 at 9:52 AM Prashant Singh <[email protected]> > wrote: > >> Thanks for bumping this thread Micah and thank you for all the work ! I >> missed this thread completely, apologies for that, I have so far been >> responding to the design docs (would be nice to link ML to doc too). >> >> For the feedback, I am not supportive of this proposal and I am looking >> forward to hear from other community members on despite these severe con >> why we should be doing it specially given we have clear aligned path on >> how to introduce these by in backward compatible way >> >> Here are my reservations : >> 1/ while the proposal says one can limit the default size 512B, it says >> it is configurable, this can severely impact the number of entries we can >> have in a manifest file, we went through the whole exercise of whether we >> should have inline manifest dv or not, and based on tradeoff we concluded >> one over the other. Giving this much of size in the worst case per data >> file inside the manifest can severely impact the query planning time and >> query execution cost (will more IO) of the iceberg readers which may be >> different than who produced the iceberg data set. >> 2/ It works on an assumption we need to do spec version bump to add new >> fields, which i think is not completely true we added things like partition >> stats / statistic field as optional, i don't understand why cant we do the >> same, specially with things like schema_id and footer_size mentioned as >> motivation. I think the community >> was pretty aligned to have schema_id as optional field to have writer >> backward compatibility as all new writers taking the benefit of this [1] >> 3/ one of motivations thats is stated is to support Vendors proprietary >> metadata for supporting their proprietary clustering algorithm, this to me >> looks like a way to work around spec to let iceberg metadata layout carry >> these info which doesn't means anything to iceberg ecosystem and can >> compromise interoperability. >> Also think of a case where Vendor A starts producing something >> partnering with Vendor B and to make things worse encrypt it and not let >> vendor C not in this partnership see it. IMHO we should not open up new >> ways that hurt the interop. >> >> I also want to thank you for proposing the meeting, unfortunately the >> proposed time doesn't work for me, i have a conflicting meeting, please >> feel free to proceed without me, I can watch the recording later as well, >> as far as my support is concerned I look forward to answers that strongly >> supporting this use case and why should we be ok accepting these cons given >> we already had a well thought path to move forward. >> >> [1] https://github.com/apache/iceberg/pull/4898 >> >> Best, >> Prashant Singh >> >> >> >> On Wed, Mar 25, 2026 at 3:22 PM Kevin Liu <[email protected]> wrote: >> >>> I added/accepted on the dev calendar. Looking forward to it! >>> >>> On Tue, Mar 24, 2026 at 5:34 PM Micah Kornfield <[email protected]> >>> wrote: >>> >>>> It seems like we might not have full alignment on this proposal, I >>>> tentatively scheduled a sync for next Monday (added to the iceberg dev >>>> events calendar). Please let me know if you are interested in joining and >>>> the time doesn't work for you (we can reschedule accordingly). >>>> >>>> Thanks, >>>> Micah >>>> >>>> On 2026/02/09 23:15:49 Micah Kornfield wrote: >>>> > As an update I've made the proposal to add this field to the Single >>>> file >>>> > commits doc. >>>> > >>>> > Please let me know if there is any additional feedback. >>>> > >>>> > Thanks, >>>> > Micah >>>> > >>>> > On Wed, Jan 21, 2026 at 5:16 PM Micah Kornfield < >>>> [email protected]> >>>> > wrote: >>>> > >>>> > > Thanks Manu, that is the right doc. >>>> > > >>>> > > As an update, I've incorporated feedback from the community to the >>>> > > document: >>>> > > >>>> > > At a high level the changes are: >>>> > > - Renamed the field from "tags" to "attributes" >>>> > > - Clarified limits on attributes should only be enforced for new >>>> data. >>>> > > Existing tags must always be carried through. >>>> > > - Added more details on enforcing size of tags. >>>> > > >>>> > > Are there any objections to folding the proposal into the V4 >>>> metadata >>>> > > proposal? Again, the reasons for doing so are mostly around >>>> ensuring >>>> > > consistent field numbering and making the spec update easier. >>>> > > >>>> > > If people want further discussion on this I'd be happy to discuss >>>> at the >>>> > > next V4 metadata sync or create a one-off meeting. Please let me >>>> know. >>>> > > >>>> > > Thanks, >>>> > > Micah >>>> > > >>>> > > On Mon, Jan 5, 2026 at 5:48 PM Manu Zhang <[email protected]> >>>> wrote: >>>> > > >>>> > >> Happy new year Micah. Are you linking the wrong doc (Iceberg >>>> Single File >>>> > >> Commits) ? >>>> > >> I think you are referring to >>>> > >> >>>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz >>>> > >> >>>> > >> Best, >>>> > >> Manu >>>> > >> >>>> > >> On Tue, Jan 6, 2026 at 2:19 AM Micah Kornfield < >>>> [email protected]> >>>> > >> wrote: >>>> > >> >>>> > >>> Happy new year everyone, I just wanted to bump this thread (most >>>> > >>> discussion has been happening on the doc [1]) in case it was >>>> missed over >>>> > >>> the holidays. >>>> > >>> >>>> > >>> Thanks, >>>> > >>> Micah >>>> > >>> >>>> > >>> [1] >>>> > >>> >>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw >>>> > >>> >>>> > >>> On Fri, Dec 19, 2025 at 2:14 PM Micah Kornfield < >>>> [email protected]> >>>> > >>> wrote: >>>> > >>> >>>> > >>>> Sounds good, will wait until next year. >>>> > >>>> >>>> > >>>> On Fri, Dec 19, 2025 at 2:13 PM Steven Wu <[email protected]> >>>> wrote: >>>> > >>>> >>>> > >>>>> Micah, many people will be OOO in the next two weeks. Can we >>>> extend >>>> > >>>>> the feedback deadline to at least 1-2 weeks after the new year? >>>> > >>>>> >>>> > >>>>> On Fri, Dec 19, 2025 at 8:45 AM Micah Kornfield < >>>> [email protected]> >>>> > >>>>> wrote: >>>> > >>>>> >>>> > >>>>>> > I have no problem with adding this discussion to the single >>>> file >>>> > >>>>>> work, but I'm not sure that would speed it up? Seems like this >>>> is a pretty >>>> > >>>>>> independent addition to the metadata layout? >>>> > >>>>>> >>>> > >>>>>> Yes, it is fairly independent. The main reason I wanted to >>>> > >>>>>> consolidate in the doc, it appears there is a bit of metadata >>>> > >>>>>> re-arrangement and new fields. I wanted to make sure that: >>>> > >>>>>> >>>> > >>>>>> 1. We avoid field ID conflicts. >>>> > >>>>>> 2. When writing up the final spec changes it is easy to >>>> manage and >>>> > >>>>>> not create a dependency one way or another between the two of >>>> these. >>>> > >>>>>> >>>> > >>>>>> Happy to keep the implementation of the guard-rails as a >>>> separate >>>> > >>>>>> piece of work. >>>> > >>>>>> >>>> > >>>>>> Cheers, >>>> > >>>>>> Micah >>>> > >>>>>> >>>> > >>>>>> On Fri, Dec 19, 2025 at 7:31 AM Russell Spitzer < >>>> > >>>>>> [email protected]> wrote: >>>> > >>>>>> >>>> > >>>>>>> I have no problem with adding this discussion to the single >>>> file >>>> > >>>>>>> work, but I'm not sure that would speed it up? Seems like >>>> this is a pretty >>>> > >>>>>>> independent addition to the metadata layout? >>>> > >>>>>>> >>>> > >>>>>>> On Thu, Dec 18, 2025 at 6:28 PM Micah Kornfield < >>>> > >>>>>>> [email protected]> wrote: >>>> > >>>>>>> >>>> > >>>>>>>> Thanks for the clarification, Micah! I want to explicitly >>>> call out >>>> > >>>>>>>>> (and double-confirm) the key principle here: all tags must >>>> be strictly >>>> > >>>>>>>>> optional and never required for correctness or basic >>>> functionality. Engines >>>> > >>>>>>>>> should always be able to safely drop or ignore tags without >>>> breaking reads >>>> > >>>>>>>>> or writes, with the only possible impact being suboptimal >>>> behavior (e.g., >>>> > >>>>>>>>> extra I/O), as you described. >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>>>> 100% I will also add this summary to the bottom of the >>>> requirements >>>> > >>>>>>>> section. >>>> > >>>>>>>> >>>> > >>>>>>>> Based on mailing list discussion and doc comments (or lack >>>> > >>>>>>>> thereof), it does not seem like there are strong objections >>>> to adding this >>>> > >>>>>>>> for V4. Prashant seemed to maybe have concerns, so I'd like >>>> to understand >>>> > >>>>>>>> if they are blockers. >>>> > >>>>>>>> >>>> > >>>>>>>> If there isn't additional feedback by the end of next week, >>>> I'd >>>> > >>>>>>>> like to assume a lazy consensus and consolidate this with >>>> the single file >>>> > >>>>>>>> improvement work, which has already reorganized the metadata >>>> schema [1]. >>>> > >>>>>>>> Please let me know if there is a different process. >>>> > >>>>>>>> >>>> > >>>>>>>> Thanks, >>>> > >>>>>>>> Micah >>>> > >>>>>>>> >>>> > >>>>>>>> [1] >>>> > >>>>>>>> >>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw >>>> > >>>>>>>> >>>> > >>>>>>>> On Wed, Dec 17, 2025 at 5:38 PM Yufei Gu < >>>> [email protected]> >>>> > >>>>>>>> wrote: >>>> > >>>>>>>> >>>> > >>>>>>>>> Thanks for the clarification, Micah! I want to explicitly >>>> call out >>>> > >>>>>>>>> (and double-confirm) the key principle here: all tags must >>>> be strictly >>>> > >>>>>>>>> optional and never required for correctness or basic >>>> functionality. Engines >>>> > >>>>>>>>> should always be able to safely drop or ignore tags without >>>> breaking reads >>>> > >>>>>>>>> or writes, with the only possible impact being suboptimal >>>> behavior (e.g., >>>> > >>>>>>>>> extra I/O), as you described. >>>> > >>>>>>>>> >>>> > >>>>>>>>> As long as this constraint is clearly stated and enforced, >>>> the >>>> > >>>>>>>>> trade-off feels reasonable to me. >>>> > >>>>>>>>> >>>> > >>>>>>>>> Yufei >>>> > >>>>>>>>> >>>> > >>>>>>>>> >>>> > >>>>>>>>> On Mon, Dec 15, 2025 at 4:28 PM Micah Kornfield < >>>> > >>>>>>>>> [email protected]> wrote: >>>> > >>>>>>>>> >>>> > >>>>>>>>>> Hi Yufei, >>>> > >>>>>>>>>> >>>> > >>>>>>>>>>> If one engine started to rely on a tag for certain >>>> reasons(like >>>> > >>>>>>>>>>> clustering algorithm), would data file >>>> rewrite(compaction) by another >>>> > >>>>>>>>>>> engine remove the tag, and break the engine relying on it. >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> The intent here is that dropping tags should never break an >>>> > >>>>>>>>>> engine. But it could cause suboptimal operations. For >>>> instance, one >>>> > >>>>>>>>>> example I brought in the docs is using tags to cache >>>> parquet footer size, >>>> > >>>>>>>>>> to make sure it is fetched in 1 I/O. >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> In this case the following would occur. >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> 1. Engine 1 does a write to file 1 and records its footer >>>> size >>>> > >>>>>>>>>> in tags. >>>> > >>>>>>>>>> 2. Engine 2 does a rewrite/compactions and produces File 2 >>>> > >>>>>>>>>> without tags. >>>> > >>>>>>>>>> 3. Engine 1 then tries to read file 2. The tag for footer >>>> > >>>>>>>>>> length is missing so it falls back reading a reasonable >>>> number of bytes >>>> > >>>>>>>>>> from the end of the parquet file, hoping the entire footer >>>> is retrieved >>>> > >>>>>>>>>> (and if it isn't a second I/O is necessary). >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> Similarly for clustering algorithms, I think the result >>>> could >>>> > >>>>>>>>>> yield a sub-optimally clustered table, or perhaps >>>> redundant clustering >>>> > >>>>>>>>>> operations but shouldn't break anything. This is no worse >>>> then the case >>>> > >>>>>>>>>> today though if engine 1 and engine 2 have different >>>> clustering algorithms >>>> > >>>>>>>>>> and they are being run in interleaved fashion on the same >>>> table. In this >>>> > >>>>>>>>>> case it is highly likely that some amount of duplicate >>>> compaction is >>>> > >>>>>>>>>> happening. >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> In the current proposal, any metadata that is required for >>>> proper >>>> > >>>>>>>>>> functioning should never be put in tags. >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> Thanks, >>>> > >>>>>>>>>> Micah >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> On Mon, Dec 15, 2025 at 4:02 PM Yufei Gu < >>>> [email protected]> >>>> > >>>>>>>>>> wrote: >>>> > >>>>>>>>>> >>>> > >>>>>>>>>>> Thanks for the proposal! >>>> > >>>>>>>>>>> >>>> > >>>>>>>>>>> If one engine started to rely on a tag for certain >>>> reasons(like >>>> > >>>>>>>>>>> clustering algorithm), would data file >>>> rewrite(compaction) by another >>>> > >>>>>>>>>>> engine remove the tag, and break the engine relying on it. >>>> > >>>>>>>>>>> >>>> > >>>>>>>>>>> Yufei >>>> > >>>>>>>>>>> >>>> > >>>>>>>>>>> >>>> > >>>>>>>>>>> On Wed, Dec 10, 2025 at 2:58 PM Micah Kornfield < >>>> > >>>>>>>>>>> [email protected]> wrote: >>>> > >>>>>>>>>>> >>>> > >>>>>>>>>>>> Hi Iceberg Dev, >>>> > >>>>>>>>>>>> I added a proposal [1] to add a key-value tags field for >>>> files >>>> > >>>>>>>>>>>> in V4 metadata [2]. More details are in the document >>>> but the intent is to >>>> > >>>>>>>>>>>> allow engines to store optional metadata associated with >>>> these files: >>>> > >>>>>>>>>>>> >>>> > >>>>>>>>>>>> 1. The proposed field is optional and cannot be used for >>>> > >>>>>>>>>>>> metadata required for reading the table correctly. >>>> > >>>>>>>>>>>> 2. It also proposes guard-rails for not letting tags >>>> cause >>>> > >>>>>>>>>>>> metadata bloat. >>>> > >>>>>>>>>>>> >>>> > >>>>>>>>>>>> Looking forward to hearing everyone's thoughts and >>>> feedback. >>>> > >>>>>>>>>>>> >>>> > >>>>>>>>>>>> Thanks, >>>> > >>>>>>>>>>>> Micah >>>> > >>>>>>>>>>>> >>>> > >>>>>>>>>>>> [1] https://github.com/apache/iceberg/issues/14815 >>>> > >>>>>>>>>>>> [2] >>>> > >>>>>>>>>>>> >>>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz >>>> > >>>>>>>>>>>> >>>> > >>>>>>>>>>>> >>>> > >>>> >>>
