Re: [DISCUSS] Adding Tags field to Iceberg V4

Manu Zhang Mon, 05 Jan 2026 17:48:38 -0800

Happy new year Micah. Are you linking the wrong doc (Iceberg Single File
Commits) ?
I think you are referring to
https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz


Best,
Manu

On Tue, Jan 6, 2026 at 2:19 AM Micah Kornfield <[email protected]>
wrote:

> Happy new year everyone, I just wanted to bump this thread (most
> discussion has been happening on the doc [1]) in case it was missed over
> the holidays.
>
> Thanks,
> Micah
>
> [1]
> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw
>
> On Fri, Dec 19, 2025 at 2:14 PM Micah Kornfield <[email protected]>
> wrote:
>
>> Sounds good, will wait until next year.
>>
>> On Fri, Dec 19, 2025 at 2:13 PM Steven Wu <[email protected]> wrote:
>>
>>> Micah, many people will be OOO in the next two weeks. Can we extend the
>>> feedback deadline to at least 1-2 weeks after the new year?
>>>
>>> On Fri, Dec 19, 2025 at 8:45 AM Micah Kornfield <[email protected]>
>>> wrote:
>>>
>>>> > I have no problem with adding this discussion to the single file
>>>> work, but I'm not sure that would speed it up? Seems like this is a pretty
>>>> independent addition to the metadata layout?
>>>>
>>>> Yes, it is fairly independent.  The main reason I wanted to consolidate
>>>> in the doc, it appears there is  a bit of metadata re-arrangement and new
>>>> fields.  I wanted to make sure that:
>>>>
>>>> 1.  We avoid field ID conflicts.
>>>> 2.  When writing up the final spec changes it is easy to manage and not
>>>> create a dependency one way or another between the two of these.
>>>>
>>>> Happy to keep the implementation of the guard-rails as a separate piece
>>>> of work.
>>>>
>>>> Cheers,
>>>> Micah
>>>>
>>>> On Fri, Dec 19, 2025 at 7:31 AM Russell Spitzer <
>>>> [email protected]> wrote:
>>>>
>>>>> I have no problem with adding this discussion to the single file work,
>>>>> but I'm not sure that would speed it up? Seems like this is a pretty
>>>>> independent addition to the metadata layout?
>>>>>
>>>>> On Thu, Dec 18, 2025 at 6:28 PM Micah Kornfield <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks for the clarification, Micah! I want to explicitly call out
>>>>>>> (and double-confirm) the key principle here: all tags must be strictly
>>>>>>> optional and never required for correctness or basic functionality. 
>>>>>>> Engines
>>>>>>> should always be able to safely drop or ignore tags without breaking 
>>>>>>> reads
>>>>>>> or writes, with the only possible impact being suboptimal behavior 
>>>>>>> (e.g.,
>>>>>>> extra I/O), as you described.
>>>>>>
>>>>>>
>>>>>> 100% I will also add this summary to the bottom of the requirements
>>>>>> section.
>>>>>>
>>>>>> Based on mailing list discussion and doc comments (or lack thereof),
>>>>>> it does not seem like there are strong objections to adding this for V4.
>>>>>> Prashant seemed to maybe have concerns, so I'd like to understand if they
>>>>>> are blockers.
>>>>>>
>>>>>> If there isn't additional feedback by the end of next week, I'd like
>>>>>> to assume a lazy consensus and consolidate this with the single file
>>>>>> improvement work, which has already reorganized the metadata schema [1].
>>>>>> Please let me know if there is a different process.
>>>>>>
>>>>>> Thanks,
>>>>>> Micah
>>>>>>
>>>>>> [1]
>>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw
>>>>>>
>>>>>> On Wed, Dec 17, 2025 at 5:38 PM Yufei Gu <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for the clarification, Micah! I want to explicitly call out
>>>>>>> (and double-confirm) the key principle here: all tags must be strictly
>>>>>>> optional and never required for correctness or basic functionality. 
>>>>>>> Engines
>>>>>>> should always be able to safely drop or ignore tags without breaking 
>>>>>>> reads
>>>>>>> or writes, with the only possible impact being suboptimal behavior 
>>>>>>> (e.g.,
>>>>>>> extra I/O), as you described.
>>>>>>>
>>>>>>> As long as this constraint is clearly stated and enforced, the
>>>>>>> trade-off feels reasonable to me.
>>>>>>>
>>>>>>> Yufei
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Dec 15, 2025 at 4:28 PM Micah Kornfield <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Yufei,
>>>>>>>>
>>>>>>>>> If one engine started to rely on a tag for certain reasons(like
>>>>>>>>> clustering algorithm), would data file rewrite(compaction) by another
>>>>>>>>> engine remove the tag, and break the engine relying on it.
>>>>>>>>
>>>>>>>>
>>>>>>>> The intent here is that dropping tags should never break an
>>>>>>>> engine.  But it could cause suboptimal operations.  For instance, one
>>>>>>>> example I brought in the docs is using tags to cache parquet footer 
>>>>>>>> size,
>>>>>>>> to make sure it is fetched in 1 I/O.
>>>>>>>>
>>>>>>>> In this case the following would occur.
>>>>>>>>
>>>>>>>> 1.  Engine 1 does a write to file 1 and records its footer size in
>>>>>>>> tags.
>>>>>>>> 2.  Engine 2 does a rewrite/compactions and produces File 2 without
>>>>>>>> tags.
>>>>>>>> 3.  Engine 1 then tries to read file 2.  The tag for footer length
>>>>>>>> is missing so it falls back reading a reasonable number of bytes from 
>>>>>>>> the
>>>>>>>> end of the parquet file, hoping the entire footer is retrieved (and if 
>>>>>>>> it
>>>>>>>> isn't a second I/O is necessary).
>>>>>>>>
>>>>>>>> Similarly for clustering algorithms, I think the result could yield
>>>>>>>> a sub-optimally clustered table, or perhaps redundant clustering 
>>>>>>>> operations
>>>>>>>> but shouldn't break anything. This is no worse then the case today 
>>>>>>>> though
>>>>>>>> if engine 1 and engine 2 have different clustering algorithms and they 
>>>>>>>> are
>>>>>>>> being run in interleaved fashion on the same table.  In this case it is
>>>>>>>> highly likely that some amount of duplicate compaction is happening.
>>>>>>>>
>>>>>>>> In the current proposal, any metadata that is required for proper
>>>>>>>> functioning should never be put in tags.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Micah
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Dec 15, 2025 at 4:02 PM Yufei Gu <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for the proposal!
>>>>>>>>>
>>>>>>>>> If one engine started to rely on a tag for certain reasons(like
>>>>>>>>> clustering algorithm), would data file rewrite(compaction) by another
>>>>>>>>> engine remove the tag, and break the engine relying on it.
>>>>>>>>>
>>>>>>>>> Yufei
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Dec 10, 2025 at 2:58 PM Micah Kornfield <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Iceberg Dev,
>>>>>>>>>> I added a proposal [1] to add a key-value tags field for files in
>>>>>>>>>> V4 metadata [2].  More details are in the document but the intent is 
>>>>>>>>>> to
>>>>>>>>>> allow engines to store optional metadata associated with these files:
>>>>>>>>>>
>>>>>>>>>> 1.  The proposed field is optional and cannot be used for
>>>>>>>>>> metadata required for reading the table correctly.
>>>>>>>>>> 2.  It also proposes guard-rails for not letting tags cause
>>>>>>>>>> metadata bloat.
>>>>>>>>>>
>>>>>>>>>> Looking forward to hearing everyone's thoughts and feedback.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Micah
>>>>>>>>>>
>>>>>>>>>> [1] https://github.com/apache/iceberg/issues/14815
>>>>>>>>>> [2]
>>>>>>>>>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz
>>>>>>>>>>
>>>>>>>>>>

Re: [DISCUSS] Adding Tags field to Iceberg V4

Reply via email to