Re: [DISCUSS] Adding Tags field to Iceberg V4

Micah Kornfield Thu, 26 Mar 2026 15:23:54 -0700

Thanks Kevin for accepting.  Thanks for your feedback Prashant, since you
have been active reviewing, I moved the event to Tuesday at a time that you
mentioned you would be available, hopefully this doesn't exclude anybody
else who wants to join the conversation.


Thanks,
Micah

On Thu, Mar 26, 2026 at 9:52 AM Prashant Singh <[email protected]>
wrote:

> Thanks for bumping this thread Micah and thank you for all the work ! I
> missed this thread completely, apologies for that, I have so far been
> responding to the design docs (would be nice to link ML to doc too).
>
> For the feedback, I am not supportive of this proposal and I am looking
> forward to hear from other community members on despite these severe con
> why we should be doing it  specially given we have clear aligned path on
> how to introduce these by in backward compatible way
>
> Here are my reservations :
> 1/ while the proposal says one can limit the default size 512B, it says it
> is configurable, this can severely impact the number of entries we can have
> in a manifest file, we went through the whole exercise of  whether we
> should have inline manifest dv or not, and based on tradeoff we concluded
> one over the other. Giving this much of size in the worst case per data
> file inside the manifest can severely impact the query planning time and
> query execution cost (will more IO) of the iceberg readers which may be
> different than who produced the iceberg data set.
> 2/ It works on an assumption we need to do spec version bump to add new
> fields, which i think is not completely true we added things like partition
> stats / statistic field as optional, i don't understand why cant we do the
> same, specially with things like schema_id and footer_size mentioned as
> motivation. I think the community
> was pretty aligned to have schema_id as optional field to have writer
> backward compatibility as all new writers taking the benefit of this [1]
> 3/ one of motivations thats is stated is to support Vendors proprietary
> metadata for supporting their proprietary clustering algorithm, this to me
> looks like a way to work around spec to let iceberg metadata layout carry
> these info which doesn't means anything to iceberg ecosystem and can
> compromise interoperability.
> Also think of a case where Vendor A starts producing  something
> partnering with Vendor B and to make things worse encrypt it and not let
> vendor C not in this partnership see it. IMHO we should not open up new
> ways that hurt the interop.
>
> I also want to thank you for proposing the meeting, unfortunately the
> proposed time doesn't work for me, i have a conflicting meeting, please
> feel free to proceed without me, I can watch the recording later as well,
> as far as my support is concerned I look forward to answers that strongly
> supporting this use case and why should we be ok accepting these cons given
> we already had a well thought path to move forward.
>
> [1] https://github.com/apache/iceberg/pull/4898
>
> Best,
> Prashant Singh
>
>
>
> On Wed, Mar 25, 2026 at 3:22 PM Kevin Liu <[email protected]> wrote:
>
>> I added/accepted on the dev calendar. Looking forward to it!
>>
>> On Tue, Mar 24, 2026 at 5:34 PM Micah Kornfield <[email protected]>
>> wrote:
>>
>>> It seems like we might not have full alignment on this proposal, I
>>> tentatively scheduled a sync for next Monday (added to the iceberg dev
>>> events calendar).  Please let me know if you are interested in joining and
>>> the time doesn't work for you (we can reschedule accordingly).
>>>
>>> Thanks,
>>> Micah
>>>
>>> On 2026/02/09 23:15:49 Micah Kornfield wrote:
>>> > As an update I've made the proposal to add this field to the Single
>>> file
>>> > commits doc.
>>> >
>>> > Please let me know if there is any additional feedback.
>>> >
>>> > Thanks,
>>> > Micah
>>> >
>>> > On Wed, Jan 21, 2026 at 5:16 PM Micah Kornfield <[email protected]
>>> >
>>> > wrote:
>>> >
>>> > > Thanks Manu, that is the right doc.
>>> > >
>>> > > As an update, I've incorporated feedback from the community to the
>>> > > document:
>>> > >
>>> > > At a high level the changes are:
>>> > > - Renamed the field from "tags" to "attributes"
>>> > > - Clarified limits on attributes should only be enforced for new
>>> data.
>>> > > Existing tags must always be carried through.
>>> > > - Added more details on enforcing size of tags.
>>> > >
>>> > > Are there any objections to folding the proposal into the V4 metadata
>>> > > proposal?  Again, the reasons for doing so are mostly around ensuring
>>> > > consistent field numbering and making the spec update easier.
>>> > >
>>> > > If people want further discussion on this I'd be happy to discuss at
>>> the
>>> > > next V4 metadata sync or create a one-off meeting.  Please let me
>>> know.
>>> > >
>>> > > Thanks,
>>> > > Micah
>>> > >
>>> > > On Mon, Jan 5, 2026 at 5:48 PM Manu Zhang <[email protected]>
>>> wrote:
>>> > >
>>> > >> Happy new year Micah. Are you linking the wrong doc (Iceberg Single
>>> File
>>> > >> Commits) ?
>>> > >> I think you are referring to
>>> > >>
>>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz
>>> > >>
>>> > >> Best,
>>> > >> Manu
>>> > >>
>>> > >> On Tue, Jan 6, 2026 at 2:19 AM Micah Kornfield <
>>> [email protected]>
>>> > >> wrote:
>>> > >>
>>> > >>> Happy new year everyone, I just wanted to bump this thread (most
>>> > >>> discussion has been happening on the doc [1]) in case it was
>>> missed over
>>> > >>> the holidays.
>>> > >>>
>>> > >>> Thanks,
>>> > >>> Micah
>>> > >>>
>>> > >>> [1]
>>> > >>>
>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw
>>> > >>>
>>> > >>> On Fri, Dec 19, 2025 at 2:14 PM Micah Kornfield <
>>> [email protected]>
>>> > >>> wrote:
>>> > >>>
>>> > >>>> Sounds good, will wait until next year.
>>> > >>>>
>>> > >>>> On Fri, Dec 19, 2025 at 2:13 PM Steven Wu <[email protected]>
>>> wrote:
>>> > >>>>
>>> > >>>>> Micah, many people will be OOO in the next two weeks. Can we
>>> extend
>>> > >>>>> the feedback deadline to at least 1-2 weeks after the new year?
>>> > >>>>>
>>> > >>>>> On Fri, Dec 19, 2025 at 8:45 AM Micah Kornfield <
>>> [email protected]>
>>> > >>>>> wrote:
>>> > >>>>>
>>> > >>>>>> > I have no problem with adding this discussion to the single
>>> file
>>> > >>>>>> work, but I'm not sure that would speed it up? Seems like this
>>> is a pretty
>>> > >>>>>> independent addition to the metadata layout?
>>> > >>>>>>
>>> > >>>>>> Yes, it is fairly independent.  The main reason I wanted to
>>> > >>>>>> consolidate in the doc, it appears there is  a bit of metadata
>>> > >>>>>> re-arrangement and new fields.  I wanted to make sure that:
>>> > >>>>>>
>>> > >>>>>> 1.  We avoid field ID conflicts.
>>> > >>>>>> 2.  When writing up the final spec changes it is easy to manage
>>> and
>>> > >>>>>> not create a dependency one way or another between the two of
>>> these.
>>> > >>>>>>
>>> > >>>>>> Happy to keep the implementation of the guard-rails as a
>>> separate
>>> > >>>>>> piece of work.
>>> > >>>>>>
>>> > >>>>>> Cheers,
>>> > >>>>>> Micah
>>> > >>>>>>
>>> > >>>>>> On Fri, Dec 19, 2025 at 7:31 AM Russell Spitzer <
>>> > >>>>>> [email protected]> wrote:
>>> > >>>>>>
>>> > >>>>>>> I have no problem with adding this discussion to the single
>>> file
>>> > >>>>>>> work, but I'm not sure that would speed it up? Seems like this
>>> is a pretty
>>> > >>>>>>> independent addition to the metadata layout?
>>> > >>>>>>>
>>> > >>>>>>> On Thu, Dec 18, 2025 at 6:28 PM Micah Kornfield <
>>> > >>>>>>> [email protected]> wrote:
>>> > >>>>>>>
>>> > >>>>>>>> Thanks for the clarification, Micah! I want to explicitly
>>> call out
>>> > >>>>>>>>> (and double-confirm) the key principle here: all tags must
>>> be strictly
>>> > >>>>>>>>> optional and never required for correctness or basic
>>> functionality. Engines
>>> > >>>>>>>>> should always be able to safely drop or ignore tags without
>>> breaking reads
>>> > >>>>>>>>> or writes, with the only possible impact being suboptimal
>>> behavior (e.g.,
>>> > >>>>>>>>> extra I/O), as you described.
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>> 100% I will also add this summary to the bottom of the
>>> requirements
>>> > >>>>>>>> section.
>>> > >>>>>>>>
>>> > >>>>>>>> Based on mailing list discussion and doc comments (or lack
>>> > >>>>>>>> thereof), it does not seem like there are strong objections
>>> to adding this
>>> > >>>>>>>> for V4.  Prashant seemed to maybe have concerns, so I'd like
>>> to understand
>>> > >>>>>>>> if they are blockers.
>>> > >>>>>>>>
>>> > >>>>>>>> If there isn't additional feedback by the end of next week,
>>> I'd
>>> > >>>>>>>> like to assume a lazy consensus and consolidate this with the
>>> single file
>>> > >>>>>>>> improvement work, which has already reorganized the metadata
>>> schema [1].
>>> > >>>>>>>> Please let me know if there is a different process.
>>> > >>>>>>>>
>>> > >>>>>>>> Thanks,
>>> > >>>>>>>> Micah
>>> > >>>>>>>>
>>> > >>>>>>>> [1]
>>> > >>>>>>>>
>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw
>>> > >>>>>>>>
>>> > >>>>>>>> On Wed, Dec 17, 2025 at 5:38 PM Yufei Gu <
>>> [email protected]>
>>> > >>>>>>>> wrote:
>>> > >>>>>>>>
>>> > >>>>>>>>> Thanks for the clarification, Micah! I want to explicitly
>>> call out
>>> > >>>>>>>>> (and double-confirm) the key principle here: all tags must
>>> be strictly
>>> > >>>>>>>>> optional and never required for correctness or basic
>>> functionality. Engines
>>> > >>>>>>>>> should always be able to safely drop or ignore tags without
>>> breaking reads
>>> > >>>>>>>>> or writes, with the only possible impact being suboptimal
>>> behavior (e.g.,
>>> > >>>>>>>>> extra I/O), as you described.
>>> > >>>>>>>>>
>>> > >>>>>>>>> As long as this constraint is clearly stated and enforced,
>>> the
>>> > >>>>>>>>> trade-off feels reasonable to me.
>>> > >>>>>>>>>
>>> > >>>>>>>>> Yufei
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>> On Mon, Dec 15, 2025 at 4:28 PM Micah Kornfield <
>>> > >>>>>>>>> [email protected]> wrote:
>>> > >>>>>>>>>
>>> > >>>>>>>>>> Hi Yufei,
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>> If one engine started to rely on a tag for certain
>>> reasons(like
>>> > >>>>>>>>>>> clustering algorithm), would data file rewrite(compaction)
>>> by another
>>> > >>>>>>>>>>> engine remove the tag, and break the engine relying on it.
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> The intent here is that dropping tags should never break an
>>> > >>>>>>>>>> engine.  But it could cause suboptimal operations.  For
>>> instance, one
>>> > >>>>>>>>>> example I brought in the docs is using tags to cache
>>> parquet footer size,
>>> > >>>>>>>>>> to make sure it is fetched in 1 I/O.
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> In this case the following would occur.
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> 1.  Engine 1 does a write to file 1 and records its footer
>>> size
>>> > >>>>>>>>>> in tags.
>>> > >>>>>>>>>> 2.  Engine 2 does a rewrite/compactions and produces File 2
>>> > >>>>>>>>>> without tags.
>>> > >>>>>>>>>> 3.  Engine 1 then tries to read file 2.  The tag for footer
>>> > >>>>>>>>>> length is missing so it falls back reading a reasonable
>>> number of bytes
>>> > >>>>>>>>>> from the end of the parquet file, hoping the entire footer
>>> is retrieved
>>> > >>>>>>>>>> (and if it isn't a second I/O is necessary).
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> Similarly for clustering algorithms, I think the result
>>> could
>>> > >>>>>>>>>> yield a sub-optimally clustered table, or perhaps redundant
>>> clustering
>>> > >>>>>>>>>> operations but shouldn't break anything. This is no worse
>>> then the case
>>> > >>>>>>>>>> today though if engine 1 and engine 2 have different
>>> clustering algorithms
>>> > >>>>>>>>>> and they are being run in interleaved fashion on the same
>>> table.  In this
>>> > >>>>>>>>>> case it is highly likely that some amount of duplicate
>>> compaction is
>>> > >>>>>>>>>> happening.
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> In the current proposal, any metadata that is required for
>>> proper
>>> > >>>>>>>>>> functioning should never be put in tags.
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> Thanks,
>>> > >>>>>>>>>> Micah
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> On Mon, Dec 15, 2025 at 4:02 PM Yufei Gu <
>>> [email protected]>
>>> > >>>>>>>>>> wrote:
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>> Thanks for the proposal!
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>> If one engine started to rely on a tag for certain
>>> reasons(like
>>> > >>>>>>>>>>> clustering algorithm), would data file rewrite(compaction)
>>> by another
>>> > >>>>>>>>>>> engine remove the tag, and break the engine relying on it.
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>> Yufei
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>> On Wed, Dec 10, 2025 at 2:58 PM Micah Kornfield <
>>> > >>>>>>>>>>> [email protected]> wrote:
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>> Hi Iceberg Dev,
>>> > >>>>>>>>>>>> I added a proposal [1] to add a key-value tags field for
>>> files
>>> > >>>>>>>>>>>> in V4 metadata [2].  More details are in the document but
>>> the intent is to
>>> > >>>>>>>>>>>> allow engines to store optional metadata associated with
>>> these files:
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>> 1.  The proposed field is optional and cannot be used for
>>> > >>>>>>>>>>>> metadata required for reading the table correctly.
>>> > >>>>>>>>>>>> 2.  It also proposes guard-rails for not letting tags
>>> cause
>>> > >>>>>>>>>>>> metadata bloat.
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>> Looking forward to hearing everyone's thoughts and
>>> feedback.
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>> Thanks,
>>> > >>>>>>>>>>>> Micah
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>> [1] https://github.com/apache/iceberg/issues/14815
>>> > >>>>>>>>>>>> [2]
>>> > >>>>>>>>>>>>
>>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>>
>>> >
>>>
>>

Re: [DISCUSS] Adding Tags field to Iceberg V4

Reply via email to