Re: [DISCUSS] Adding Tags field to Iceberg V4

Micah Kornfield Thu, 26 Mar 2026 16:40:07 -0700

Hi Prashant,
I unfortunately, I have conflicts on Wednesdays for the foreseeable future
at that time.  Hopefully between the sync and mailing list we can figure
out a path forward.  If anybody else has feedback please add it to the
Google doc or reply to the thread and I can address it.


Thanks,
Micah

On Thursday, March 26, 2026, Prashant Singh <[email protected]>
wrote:

> Thank you for being flexible Micah, how about we add this to the agenda
> item in iceberg community sync which is just a day after at 9 pm, a lot of
> folks join and we will have better participation.
> and it seems like we would have time to talk since i see the agenda is
> still open, if we can't conclude we can have a dedicated sync for it.
>
> Best,
> Prashant Singh
>
> On Thu, Mar 26, 2026 at 3:23 PM Micah Kornfield <[email protected]>
> wrote:
>
>> Thanks Kevin for accepting.  Thanks for your feedback Prashant, since you
>> have been active reviewing, I moved the event to Tuesday at a time that you
>> mentioned you would be available, hopefully this doesn't exclude anybody
>> else who wants to join the conversation.
>>
>> Thanks,
>> Micah
>>
>> On Thu, Mar 26, 2026 at 9:52 AM Prashant Singh <[email protected]>
>> wrote:
>>
>>> Thanks for bumping this thread Micah and thank you for all the work ! I
>>> missed this thread completely, apologies for that, I have so far been
>>> responding to the design docs (would be nice to link ML to doc too).
>>>
>>> For the feedback, I am not supportive of this proposal and I am looking
>>> forward to hear from other community members on despite these severe con
>>> why we should be doing it  specially given we have clear aligned path on
>>> how to introduce these by in backward compatible way
>>>
>>> Here are my reservations :
>>> 1/ while the proposal says one can limit the default size 512B, it says
>>> it is configurable, this can severely impact the number of entries we can
>>> have in a manifest file, we went through the whole exercise of  whether we
>>> should have inline manifest dv or not, and based on tradeoff we concluded
>>> one over the other. Giving this much of size in the worst case per data
>>> file inside the manifest can severely impact the query planning time and
>>> query execution cost (will more IO) of the iceberg readers which may be
>>> different than who produced the iceberg data set.
>>> 2/ It works on an assumption we need to do spec version bump to add new
>>> fields, which i think is not completely true we added things like partition
>>> stats / statistic field as optional, i don't understand why cant we do the
>>> same, specially with things like schema_id and footer_size mentioned as
>>> motivation. I think the community
>>> was pretty aligned to have schema_id as optional field to have writer
>>> backward compatibility as all new writers taking the benefit of this [1]
>>> 3/ one of motivations thats is stated is to support Vendors proprietary
>>> metadata for supporting their proprietary clustering algorithm, this to me
>>> looks like a way to work around spec to let iceberg metadata layout carry
>>> these info which doesn't means anything to iceberg ecosystem and can
>>> compromise interoperability.
>>> Also think of a case where Vendor A starts producing  something
>>> partnering with Vendor B and to make things worse encrypt it and not let
>>> vendor C not in this partnership see it. IMHO we should not open up new
>>> ways that hurt the interop.
>>>
>>> I also want to thank you for proposing the meeting, unfortunately the
>>> proposed time doesn't work for me, i have a conflicting meeting, please
>>> feel free to proceed without me, I can watch the recording later as well,
>>> as far as my support is concerned I look forward to answers that strongly
>>> supporting this use case and why should we be ok accepting these cons given
>>> we already had a well thought path to move forward.
>>>
>>> [1] https://github.com/apache/iceberg/pull/4898
>>>
>>> Best,
>>> Prashant Singh
>>>
>>>
>>>
>>> On Wed, Mar 25, 2026 at 3:22 PM Kevin Liu <[email protected]> wrote:
>>>
>>>> I added/accepted on the dev calendar. Looking forward to it!
>>>>
>>>> On Tue, Mar 24, 2026 at 5:34 PM Micah Kornfield <[email protected]>
>>>> wrote:
>>>>
>>>>> It seems like we might not have full alignment on this proposal, I
>>>>> tentatively scheduled a sync for next Monday (added to the iceberg dev
>>>>> events calendar).  Please let me know if you are interested in joining and
>>>>> the time doesn't work for you (we can reschedule accordingly).
>>>>>
>>>>> Thanks,
>>>>> Micah
>>>>>
>>>>> On 2026/02/09 23:15:49 Micah Kornfield wrote:
>>>>> > As an update I've made the proposal to add this field to the Single
>>>>> file
>>>>> > commits doc.
>>>>> >
>>>>> > Please let me know if there is any additional feedback.
>>>>> >
>>>>> > Thanks,
>>>>> > Micah
>>>>> >
>>>>> > On Wed, Jan 21, 2026 at 5:16 PM Micah Kornfield <
>>>>> [email protected]>
>>>>> > wrote:
>>>>> >
>>>>> > > Thanks Manu, that is the right doc.
>>>>> > >
>>>>> > > As an update, I've incorporated feedback from the community to the
>>>>> > > document:
>>>>> > >
>>>>> > > At a high level the changes are:
>>>>> > > - Renamed the field from "tags" to "attributes"
>>>>> > > - Clarified limits on attributes should only be enforced for new
>>>>> data.
>>>>> > > Existing tags must always be carried through.
>>>>> > > - Added more details on enforcing size of tags.
>>>>> > >
>>>>> > > Are there any objections to folding the proposal into the V4
>>>>> metadata
>>>>> > > proposal?  Again, the reasons for doing so are mostly around
>>>>> ensuring
>>>>> > > consistent field numbering and making the spec update easier.
>>>>> > >
>>>>> > > If people want further discussion on this I'd be happy to discuss
>>>>> at the
>>>>> > > next V4 metadata sync or create a one-off meeting.  Please let me
>>>>> know.
>>>>> > >
>>>>> > > Thanks,
>>>>> > > Micah
>>>>> > >
>>>>> > > On Mon, Jan 5, 2026 at 5:48 PM Manu Zhang <[email protected]>
>>>>> wrote:
>>>>> > >
>>>>> > >> Happy new year Micah. Are you linking the wrong doc (Iceberg
>>>>> Single File
>>>>> > >> Commits) ?
>>>>> > >> I think you are referring to
>>>>> > >> https://docs.google.com/document/d/16flxDXjpBiAs_
>>>>> cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz
>>>>> > >>
>>>>> > >> Best,
>>>>> > >> Manu
>>>>> > >>
>>>>> > >> On Tue, Jan 6, 2026 at 2:19 AM Micah Kornfield <
>>>>> [email protected]>
>>>>> > >> wrote:
>>>>> > >>
>>>>> > >>> Happy new year everyone, I just wanted to bump this thread (most
>>>>> > >>> discussion has been happening on the doc [1]) in case it was
>>>>> missed over
>>>>> > >>> the holidays.
>>>>> > >>>
>>>>> > >>> Thanks,
>>>>> > >>> Micah
>>>>> > >>>
>>>>> > >>> [1]
>>>>> > >>> https://docs.google.com/document/d/
>>>>> 1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#
>>>>> heading=h.unn922df0zzw
>>>>> > >>>
>>>>> > >>> On Fri, Dec 19, 2025 at 2:14 PM Micah Kornfield <
>>>>> [email protected]>
>>>>> > >>> wrote:
>>>>> > >>>
>>>>> > >>>> Sounds good, will wait until next year.
>>>>> > >>>>
>>>>> > >>>> On Fri, Dec 19, 2025 at 2:13 PM Steven Wu <[email protected]>
>>>>> wrote:
>>>>> > >>>>
>>>>> > >>>>> Micah, many people will be OOO in the next two weeks. Can we
>>>>> extend
>>>>> > >>>>> the feedback deadline to at least 1-2 weeks after the new year?
>>>>> > >>>>>
>>>>> > >>>>> On Fri, Dec 19, 2025 at 8:45 AM Micah Kornfield <
>>>>> [email protected]>
>>>>> > >>>>> wrote:
>>>>> > >>>>>
>>>>> > >>>>>> > I have no problem with adding this discussion to the single
>>>>> file
>>>>> > >>>>>> work, but I'm not sure that would speed it up? Seems like
>>>>> this is a pretty
>>>>> > >>>>>> independent addition to the metadata layout?
>>>>> > >>>>>>
>>>>> > >>>>>> Yes, it is fairly independent.  The main reason I wanted to
>>>>> > >>>>>> consolidate in the doc, it appears there is  a bit of metadata
>>>>> > >>>>>> re-arrangement and new fields.  I wanted to make sure that:
>>>>> > >>>>>>
>>>>> > >>>>>> 1.  We avoid field ID conflicts.
>>>>> > >>>>>> 2.  When writing up the final spec changes it is easy to
>>>>> manage and
>>>>> > >>>>>> not create a dependency one way or another between the two of
>>>>> these.
>>>>> > >>>>>>
>>>>> > >>>>>> Happy to keep the implementation of the guard-rails as a
>>>>> separate
>>>>> > >>>>>> piece of work.
>>>>> > >>>>>>
>>>>> > >>>>>> Cheers,
>>>>> > >>>>>> Micah
>>>>> > >>>>>>
>>>>> > >>>>>> On Fri, Dec 19, 2025 at 7:31 AM Russell Spitzer <
>>>>> > >>>>>> [email protected]> wrote:
>>>>> > >>>>>>
>>>>> > >>>>>>> I have no problem with adding this discussion to the single
>>>>> file
>>>>> > >>>>>>> work, but I'm not sure that would speed it up? Seems like
>>>>> this is a pretty
>>>>> > >>>>>>> independent addition to the metadata layout?
>>>>> > >>>>>>>
>>>>> > >>>>>>> On Thu, Dec 18, 2025 at 6:28 PM Micah Kornfield <
>>>>> > >>>>>>> [email protected]> wrote:
>>>>> > >>>>>>>
>>>>> > >>>>>>>> Thanks for the clarification, Micah! I want to explicitly
>>>>> call out
>>>>> > >>>>>>>>> (and double-confirm) the key principle here: all tags must
>>>>> be strictly
>>>>> > >>>>>>>>> optional and never required for correctness or basic
>>>>> functionality. Engines
>>>>> > >>>>>>>>> should always be able to safely drop or ignore tags
>>>>> without breaking reads
>>>>> > >>>>>>>>> or writes, with the only possible impact being suboptimal
>>>>> behavior (e.g.,
>>>>> > >>>>>>>>> extra I/O), as you described.
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> 100% I will also add this summary to the bottom of the
>>>>> requirements
>>>>> > >>>>>>>> section.
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> Based on mailing list discussion and doc comments (or lack
>>>>> > >>>>>>>> thereof), it does not seem like there are strong objections
>>>>> to adding this
>>>>> > >>>>>>>> for V4.  Prashant seemed to maybe have concerns, so I'd
>>>>> like to understand
>>>>> > >>>>>>>> if they are blockers.
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> If there isn't additional feedback by the end of next week,
>>>>> I'd
>>>>> > >>>>>>>> like to assume a lazy consensus and consolidate this with
>>>>> the single file
>>>>> > >>>>>>>> improvement work, which has already reorganized the
>>>>> metadata schema [1].
>>>>> > >>>>>>>> Please let me know if there is a different process.
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> Thanks,
>>>>> > >>>>>>>> Micah
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> [1]
>>>>> > >>>>>>>> https://docs.google.com/document/d/
>>>>> 1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#
>>>>> heading=h.unn922df0zzw
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> On Wed, Dec 17, 2025 at 5:38 PM Yufei Gu <
>>>>> [email protected]>
>>>>> > >>>>>>>> wrote:
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>> Thanks for the clarification, Micah! I want to explicitly
>>>>> call out
>>>>> > >>>>>>>>> (and double-confirm) the key principle here: all tags must
>>>>> be strictly
>>>>> > >>>>>>>>> optional and never required for correctness or basic
>>>>> functionality. Engines
>>>>> > >>>>>>>>> should always be able to safely drop or ignore tags
>>>>> without breaking reads
>>>>> > >>>>>>>>> or writes, with the only possible impact being suboptimal
>>>>> behavior (e.g.,
>>>>> > >>>>>>>>> extra I/O), as you described.
>>>>> > >>>>>>>>>
>>>>> > >>>>>>>>> As long as this constraint is clearly stated and enforced,
>>>>> the
>>>>> > >>>>>>>>> trade-off feels reasonable to me.
>>>>> > >>>>>>>>>
>>>>> > >>>>>>>>> Yufei
>>>>> > >>>>>>>>>
>>>>> > >>>>>>>>>
>>>>> > >>>>>>>>> On Mon, Dec 15, 2025 at 4:28 PM Micah Kornfield <
>>>>> > >>>>>>>>> [email protected]> wrote:
>>>>> > >>>>>>>>>
>>>>> > >>>>>>>>>> Hi Yufei,
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>>> If one engine started to rely on a tag for certain
>>>>> reasons(like
>>>>> > >>>>>>>>>>> clustering algorithm), would data file
>>>>> rewrite(compaction) by another
>>>>> > >>>>>>>>>>> engine remove the tag, and break the engine relying on
>>>>> it.
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>> The intent here is that dropping tags should never break
>>>>> an
>>>>> > >>>>>>>>>> engine.  But it could cause suboptimal operations.  For
>>>>> instance, one
>>>>> > >>>>>>>>>> example I brought in the docs is using tags to cache
>>>>> parquet footer size,
>>>>> > >>>>>>>>>> to make sure it is fetched in 1 I/O.
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>> In this case the following would occur.
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>> 1.  Engine 1 does a write to file 1 and records its
>>>>> footer size
>>>>> > >>>>>>>>>> in tags.
>>>>> > >>>>>>>>>> 2.  Engine 2 does a rewrite/compactions and produces File
>>>>> 2
>>>>> > >>>>>>>>>> without tags.
>>>>> > >>>>>>>>>> 3.  Engine 1 then tries to read file 2.  The tag for
>>>>> footer
>>>>> > >>>>>>>>>> length is missing so it falls back reading a reasonable
>>>>> number of bytes
>>>>> > >>>>>>>>>> from the end of the parquet file, hoping the entire
>>>>> footer is retrieved
>>>>> > >>>>>>>>>> (and if it isn't a second I/O is necessary).
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>> Similarly for clustering algorithms, I think the result
>>>>> could
>>>>> > >>>>>>>>>> yield a sub-optimally clustered table, or perhaps
>>>>> redundant clustering
>>>>> > >>>>>>>>>> operations but shouldn't break anything. This is no worse
>>>>> then the case
>>>>> > >>>>>>>>>> today though if engine 1 and engine 2 have different
>>>>> clustering algorithms
>>>>> > >>>>>>>>>> and they are being run in interleaved fashion on the same
>>>>> table.  In this
>>>>> > >>>>>>>>>> case it is highly likely that some amount of duplicate
>>>>> compaction is
>>>>> > >>>>>>>>>> happening.
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>> In the current proposal, any metadata that is required
>>>>> for proper
>>>>> > >>>>>>>>>> functioning should never be put in tags.
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>> Thanks,
>>>>> > >>>>>>>>>> Micah
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>> On Mon, Dec 15, 2025 at 4:02 PM Yufei Gu <
>>>>> [email protected]>
>>>>> > >>>>>>>>>> wrote:
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>>> Thanks for the proposal!
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>> If one engine started to rely on a tag for certain
>>>>> reasons(like
>>>>> > >>>>>>>>>>> clustering algorithm), would data file
>>>>> rewrite(compaction) by another
>>>>> > >>>>>>>>>>> engine remove the tag, and break the engine relying on
>>>>> it.
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>> Yufei
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>> On Wed, Dec 10, 2025 at 2:58 PM Micah Kornfield <
>>>>> > >>>>>>>>>>> [email protected]> wrote:
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>>> Hi Iceberg Dev,
>>>>> > >>>>>>>>>>>> I added a proposal [1] to add a key-value tags field
>>>>> for files
>>>>> > >>>>>>>>>>>> in V4 metadata [2].  More details are in the document
>>>>> but the intent is to
>>>>> > >>>>>>>>>>>> allow engines to store optional metadata associated
>>>>> with these files:
>>>>> > >>>>>>>>>>>>
>>>>> > >>>>>>>>>>>> 1.  The proposed field is optional and cannot be used
>>>>> for
>>>>> > >>>>>>>>>>>> metadata required for reading the table correctly.
>>>>> > >>>>>>>>>>>> 2.  It also proposes guard-rails for not letting tags
>>>>> cause
>>>>> > >>>>>>>>>>>> metadata bloat.
>>>>> > >>>>>>>>>>>>
>>>>> > >>>>>>>>>>>> Looking forward to hearing everyone's thoughts and
>>>>> feedback.
>>>>> > >>>>>>>>>>>>
>>>>> > >>>>>>>>>>>> Thanks,
>>>>> > >>>>>>>>>>>> Micah
>>>>> > >>>>>>>>>>>>
>>>>> > >>>>>>>>>>>> [1] https://github.com/apache/iceberg/issues/14815
>>>>> > >>>>>>>>>>>> [2]
>>>>> > >>>>>>>>>>>> https://docs.google.com/document/d/16flxDXjpBiAs_
>>>>> cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz
>>>>> > >>>>>>>>>>>>
>>>>> > >>>>>>>>>>>>
>>>>> >
>>>>>
>>>>

Re: [DISCUSS] Adding Tags field to Iceberg V4

Reply via email to