Re: [DISCUSS] Adding Tags field to Iceberg V4

Prashant Singh Thu, 26 Mar 2026 15:39:16 -0700

Thank you for being flexible Micah, how about we add this to the agenda
item in iceberg community sync which is just a day after at 9 pm, a lot of
folks join and we will have better participation.
and it seems like we would have time to talk since i see the agenda is
still open, if we can't conclude we can have a dedicated sync for it.


Best,
Prashant Singh

On Thu, Mar 26, 2026 at 3:23 PM Micah Kornfield <[email protected]>
wrote:

> Thanks Kevin for accepting.  Thanks for your feedback Prashant, since you
> have been active reviewing, I moved the event to Tuesday at a time that you
> mentioned you would be available, hopefully this doesn't exclude anybody
> else who wants to join the conversation.
>
> Thanks,
> Micah
>
> On Thu, Mar 26, 2026 at 9:52 AM Prashant Singh <[email protected]>
> wrote:
>
>> Thanks for bumping this thread Micah and thank you for all the work ! I
>> missed this thread completely, apologies for that, I have so far been
>> responding to the design docs (would be nice to link ML to doc too).
>>
>> For the feedback, I am not supportive of this proposal and I am looking
>> forward to hear from other community members on despite these severe con
>> why we should be doing it  specially given we have clear aligned path on
>> how to introduce these by in backward compatible way
>>
>> Here are my reservations :
>> 1/ while the proposal says one can limit the default size 512B, it says
>> it is configurable, this can severely impact the number of entries we can
>> have in a manifest file, we went through the whole exercise of  whether we
>> should have inline manifest dv or not, and based on tradeoff we concluded
>> one over the other. Giving this much of size in the worst case per data
>> file inside the manifest can severely impact the query planning time and
>> query execution cost (will more IO) of the iceberg readers which may be
>> different than who produced the iceberg data set.
>> 2/ It works on an assumption we need to do spec version bump to add new
>> fields, which i think is not completely true we added things like partition
>> stats / statistic field as optional, i don't understand why cant we do the
>> same, specially with things like schema_id and footer_size mentioned as
>> motivation. I think the community
>> was pretty aligned to have schema_id as optional field to have writer
>> backward compatibility as all new writers taking the benefit of this [1]
>> 3/ one of motivations thats is stated is to support Vendors proprietary
>> metadata for supporting their proprietary clustering algorithm, this to me
>> looks like a way to work around spec to let iceberg metadata layout carry
>> these info which doesn't means anything to iceberg ecosystem and can
>> compromise interoperability.
>> Also think of a case where Vendor A starts producing  something
>> partnering with Vendor B and to make things worse encrypt it and not let
>> vendor C not in this partnership see it. IMHO we should not open up new
>> ways that hurt the interop.
>>
>> I also want to thank you for proposing the meeting, unfortunately the
>> proposed time doesn't work for me, i have a conflicting meeting, please
>> feel free to proceed without me, I can watch the recording later as well,
>> as far as my support is concerned I look forward to answers that strongly
>> supporting this use case and why should we be ok accepting these cons given
>> we already had a well thought path to move forward.
>>
>> [1] https://github.com/apache/iceberg/pull/4898
>>
>> Best,
>> Prashant Singh
>>
>>
>>
>> On Wed, Mar 25, 2026 at 3:22 PM Kevin Liu <[email protected]> wrote:
>>
>>> I added/accepted on the dev calendar. Looking forward to it!
>>>
>>> On Tue, Mar 24, 2026 at 5:34 PM Micah Kornfield <[email protected]>
>>> wrote:
>>>
>>>> It seems like we might not have full alignment on this proposal, I
>>>> tentatively scheduled a sync for next Monday (added to the iceberg dev
>>>> events calendar).  Please let me know if you are interested in joining and
>>>> the time doesn't work for you (we can reschedule accordingly).
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>> On 2026/02/09 23:15:49 Micah Kornfield wrote:
>>>> > As an update I've made the proposal to add this field to the Single
>>>> file
>>>> > commits doc.
>>>> >
>>>> > Please let me know if there is any additional feedback.
>>>> >
>>>> > Thanks,
>>>> > Micah
>>>> >
>>>> > On Wed, Jan 21, 2026 at 5:16 PM Micah Kornfield <
>>>> [email protected]>
>>>> > wrote:
>>>> >
>>>> > > Thanks Manu, that is the right doc.
>>>> > >
>>>> > > As an update, I've incorporated feedback from the community to the
>>>> > > document:
>>>> > >
>>>> > > At a high level the changes are:
>>>> > > - Renamed the field from "tags" to "attributes"
>>>> > > - Clarified limits on attributes should only be enforced for new
>>>> data.
>>>> > > Existing tags must always be carried through.
>>>> > > - Added more details on enforcing size of tags.
>>>> > >
>>>> > > Are there any objections to folding the proposal into the V4
>>>> metadata
>>>> > > proposal?  Again, the reasons for doing so are mostly around
>>>> ensuring
>>>> > > consistent field numbering and making the spec update easier.
>>>> > >
>>>> > > If people want further discussion on this I'd be happy to discuss
>>>> at the
>>>> > > next V4 metadata sync or create a one-off meeting.  Please let me
>>>> know.
>>>> > >
>>>> > > Thanks,
>>>> > > Micah
>>>> > >
>>>> > > On Mon, Jan 5, 2026 at 5:48 PM Manu Zhang <[email protected]>
>>>> wrote:
>>>> > >
>>>> > >> Happy new year Micah. Are you linking the wrong doc (Iceberg
>>>> Single File
>>>> > >> Commits) ?
>>>> > >> I think you are referring to
>>>> > >>
>>>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz
>>>> > >>
>>>> > >> Best,
>>>> > >> Manu
>>>> > >>
>>>> > >> On Tue, Jan 6, 2026 at 2:19 AM Micah Kornfield <
>>>> [email protected]>
>>>> > >> wrote:
>>>> > >>
>>>> > >>> Happy new year everyone, I just wanted to bump this thread (most
>>>> > >>> discussion has been happening on the doc [1]) in case it was
>>>> missed over
>>>> > >>> the holidays.
>>>> > >>>
>>>> > >>> Thanks,
>>>> > >>> Micah
>>>> > >>>
>>>> > >>> [1]
>>>> > >>>
>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw
>>>> > >>>
>>>> > >>> On Fri, Dec 19, 2025 at 2:14 PM Micah Kornfield <
>>>> [email protected]>
>>>> > >>> wrote:
>>>> > >>>
>>>> > >>>> Sounds good, will wait until next year.
>>>> > >>>>
>>>> > >>>> On Fri, Dec 19, 2025 at 2:13 PM Steven Wu <[email protected]>
>>>> wrote:
>>>> > >>>>
>>>> > >>>>> Micah, many people will be OOO in the next two weeks. Can we
>>>> extend
>>>> > >>>>> the feedback deadline to at least 1-2 weeks after the new year?
>>>> > >>>>>
>>>> > >>>>> On Fri, Dec 19, 2025 at 8:45 AM Micah Kornfield <
>>>> [email protected]>
>>>> > >>>>> wrote:
>>>> > >>>>>
>>>> > >>>>>> > I have no problem with adding this discussion to the single
>>>> file
>>>> > >>>>>> work, but I'm not sure that would speed it up? Seems like this
>>>> is a pretty
>>>> > >>>>>> independent addition to the metadata layout?
>>>> > >>>>>>
>>>> > >>>>>> Yes, it is fairly independent.  The main reason I wanted to
>>>> > >>>>>> consolidate in the doc, it appears there is  a bit of metadata
>>>> > >>>>>> re-arrangement and new fields.  I wanted to make sure that:
>>>> > >>>>>>
>>>> > >>>>>> 1.  We avoid field ID conflicts.
>>>> > >>>>>> 2.  When writing up the final spec changes it is easy to
>>>> manage and
>>>> > >>>>>> not create a dependency one way or another between the two of
>>>> these.
>>>> > >>>>>>
>>>> > >>>>>> Happy to keep the implementation of the guard-rails as a
>>>> separate
>>>> > >>>>>> piece of work.
>>>> > >>>>>>
>>>> > >>>>>> Cheers,
>>>> > >>>>>> Micah
>>>> > >>>>>>
>>>> > >>>>>> On Fri, Dec 19, 2025 at 7:31 AM Russell Spitzer <
>>>> > >>>>>> [email protected]> wrote:
>>>> > >>>>>>
>>>> > >>>>>>> I have no problem with adding this discussion to the single
>>>> file
>>>> > >>>>>>> work, but I'm not sure that would speed it up? Seems like
>>>> this is a pretty
>>>> > >>>>>>> independent addition to the metadata layout?
>>>> > >>>>>>>
>>>> > >>>>>>> On Thu, Dec 18, 2025 at 6:28 PM Micah Kornfield <
>>>> > >>>>>>> [email protected]> wrote:
>>>> > >>>>>>>
>>>> > >>>>>>>> Thanks for the clarification, Micah! I want to explicitly
>>>> call out
>>>> > >>>>>>>>> (and double-confirm) the key principle here: all tags must
>>>> be strictly
>>>> > >>>>>>>>> optional and never required for correctness or basic
>>>> functionality. Engines
>>>> > >>>>>>>>> should always be able to safely drop or ignore tags without
>>>> breaking reads
>>>> > >>>>>>>>> or writes, with the only possible impact being suboptimal
>>>> behavior (e.g.,
>>>> > >>>>>>>>> extra I/O), as you described.
>>>> > >>>>>>>>
>>>> > >>>>>>>>
>>>> > >>>>>>>> 100% I will also add this summary to the bottom of the
>>>> requirements
>>>> > >>>>>>>> section.
>>>> > >>>>>>>>
>>>> > >>>>>>>> Based on mailing list discussion and doc comments (or lack
>>>> > >>>>>>>> thereof), it does not seem like there are strong objections
>>>> to adding this
>>>> > >>>>>>>> for V4.  Prashant seemed to maybe have concerns, so I'd like
>>>> to understand
>>>> > >>>>>>>> if they are blockers.
>>>> > >>>>>>>>
>>>> > >>>>>>>> If there isn't additional feedback by the end of next week,
>>>> I'd
>>>> > >>>>>>>> like to assume a lazy consensus and consolidate this with
>>>> the single file
>>>> > >>>>>>>> improvement work, which has already reorganized the metadata
>>>> schema [1].
>>>> > >>>>>>>> Please let me know if there is a different process.
>>>> > >>>>>>>>
>>>> > >>>>>>>> Thanks,
>>>> > >>>>>>>> Micah
>>>> > >>>>>>>>
>>>> > >>>>>>>> [1]
>>>> > >>>>>>>>
>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw
>>>> > >>>>>>>>
>>>> > >>>>>>>> On Wed, Dec 17, 2025 at 5:38 PM Yufei Gu <
>>>> [email protected]>
>>>> > >>>>>>>> wrote:
>>>> > >>>>>>>>
>>>> > >>>>>>>>> Thanks for the clarification, Micah! I want to explicitly
>>>> call out
>>>> > >>>>>>>>> (and double-confirm) the key principle here: all tags must
>>>> be strictly
>>>> > >>>>>>>>> optional and never required for correctness or basic
>>>> functionality. Engines
>>>> > >>>>>>>>> should always be able to safely drop or ignore tags without
>>>> breaking reads
>>>> > >>>>>>>>> or writes, with the only possible impact being suboptimal
>>>> behavior (e.g.,
>>>> > >>>>>>>>> extra I/O), as you described.
>>>> > >>>>>>>>>
>>>> > >>>>>>>>> As long as this constraint is clearly stated and enforced,
>>>> the
>>>> > >>>>>>>>> trade-off feels reasonable to me.
>>>> > >>>>>>>>>
>>>> > >>>>>>>>> Yufei
>>>> > >>>>>>>>>
>>>> > >>>>>>>>>
>>>> > >>>>>>>>> On Mon, Dec 15, 2025 at 4:28 PM Micah Kornfield <
>>>> > >>>>>>>>> [email protected]> wrote:
>>>> > >>>>>>>>>
>>>> > >>>>>>>>>> Hi Yufei,
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>>> If one engine started to rely on a tag for certain
>>>> reasons(like
>>>> > >>>>>>>>>>> clustering algorithm), would data file
>>>> rewrite(compaction) by another
>>>> > >>>>>>>>>>> engine remove the tag, and break the engine relying on it.
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>> The intent here is that dropping tags should never break an
>>>> > >>>>>>>>>> engine.  But it could cause suboptimal operations.  For
>>>> instance, one
>>>> > >>>>>>>>>> example I brought in the docs is using tags to cache
>>>> parquet footer size,
>>>> > >>>>>>>>>> to make sure it is fetched in 1 I/O.
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>> In this case the following would occur.
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>> 1.  Engine 1 does a write to file 1 and records its footer
>>>> size
>>>> > >>>>>>>>>> in tags.
>>>> > >>>>>>>>>> 2.  Engine 2 does a rewrite/compactions and produces File 2
>>>> > >>>>>>>>>> without tags.
>>>> > >>>>>>>>>> 3.  Engine 1 then tries to read file 2.  The tag for footer
>>>> > >>>>>>>>>> length is missing so it falls back reading a reasonable
>>>> number of bytes
>>>> > >>>>>>>>>> from the end of the parquet file, hoping the entire footer
>>>> is retrieved
>>>> > >>>>>>>>>> (and if it isn't a second I/O is necessary).
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>> Similarly for clustering algorithms, I think the result
>>>> could
>>>> > >>>>>>>>>> yield a sub-optimally clustered table, or perhaps
>>>> redundant clustering
>>>> > >>>>>>>>>> operations but shouldn't break anything. This is no worse
>>>> then the case
>>>> > >>>>>>>>>> today though if engine 1 and engine 2 have different
>>>> clustering algorithms
>>>> > >>>>>>>>>> and they are being run in interleaved fashion on the same
>>>> table.  In this
>>>> > >>>>>>>>>> case it is highly likely that some amount of duplicate
>>>> compaction is
>>>> > >>>>>>>>>> happening.
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>> In the current proposal, any metadata that is required for
>>>> proper
>>>> > >>>>>>>>>> functioning should never be put in tags.
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>> Thanks,
>>>> > >>>>>>>>>> Micah
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>> On Mon, Dec 15, 2025 at 4:02 PM Yufei Gu <
>>>> [email protected]>
>>>> > >>>>>>>>>> wrote:
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>>> Thanks for the proposal!
>>>> > >>>>>>>>>>>
>>>> > >>>>>>>>>>> If one engine started to rely on a tag for certain
>>>> reasons(like
>>>> > >>>>>>>>>>> clustering algorithm), would data file
>>>> rewrite(compaction) by another
>>>> > >>>>>>>>>>> engine remove the tag, and break the engine relying on it.
>>>> > >>>>>>>>>>>
>>>> > >>>>>>>>>>> Yufei
>>>> > >>>>>>>>>>>
>>>> > >>>>>>>>>>>
>>>> > >>>>>>>>>>> On Wed, Dec 10, 2025 at 2:58 PM Micah Kornfield <
>>>> > >>>>>>>>>>> [email protected]> wrote:
>>>> > >>>>>>>>>>>
>>>> > >>>>>>>>>>>> Hi Iceberg Dev,
>>>> > >>>>>>>>>>>> I added a proposal [1] to add a key-value tags field for
>>>> files
>>>> > >>>>>>>>>>>> in V4 metadata [2].  More details are in the document
>>>> but the intent is to
>>>> > >>>>>>>>>>>> allow engines to store optional metadata associated with
>>>> these files:
>>>> > >>>>>>>>>>>>
>>>> > >>>>>>>>>>>> 1.  The proposed field is optional and cannot be used for
>>>> > >>>>>>>>>>>> metadata required for reading the table correctly.
>>>> > >>>>>>>>>>>> 2.  It also proposes guard-rails for not letting tags
>>>> cause
>>>> > >>>>>>>>>>>> metadata bloat.
>>>> > >>>>>>>>>>>>
>>>> > >>>>>>>>>>>> Looking forward to hearing everyone's thoughts and
>>>> feedback.
>>>> > >>>>>>>>>>>>
>>>> > >>>>>>>>>>>> Thanks,
>>>> > >>>>>>>>>>>> Micah
>>>> > >>>>>>>>>>>>
>>>> > >>>>>>>>>>>> [1] https://github.com/apache/iceberg/issues/14815
>>>> > >>>>>>>>>>>> [2]
>>>> > >>>>>>>>>>>>
>>>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz
>>>> > >>>>>>>>>>>>
>>>> > >>>>>>>>>>>>
>>>> >
>>>>
>>>

Re: [DISCUSS] Adding Tags field to Iceberg V4

Reply via email to