Re: [DISCUSS] First-class Tag concept in Iceberg REST Catalog

Yufei Gu Fri, 22 May 2026 15:42:57 -0700

Thanks EJ!

For me, an important criterion for introducing a new entity into IRC is
whether it improves interoperability. Other than tables and views, UDFs and
indexes are good examples where standardization clearly increases
interoperability. My concern with tags or labels is that they do not
necessarily provide the same benefit. One of the major use cases is sharing
tags or labels across catalogs, but the semantics of a tag can vary
significantly across systems. For example, in a governance setup, a tag
like: {"cost_center": "finance"} may participate in billing attribution,
governance reporting, automatic policy attachment, or inheritance across
schemas and derived objects. In another catalog, the exact same tag may
only be treated as lightweight user metadata for search or filtering,
without any operational meaning. And the value "finance" could have a
slightly different meaning.


In short, even though the tag structure is identical, the behavioral
semantics are not interoperable. Using them may therefore create
inconsistent expectations and outcomes across systems.

That said, I could still be convinced if there are strong arguments for the
value of introducing Tags as part of IRC. If we ever decided to introduce
tag to IRC, Tag deserves to remain first-class concept, and serve as
separated entities.

Yufei


On Wed, May 20, 2026 at 2:28 PM EJ Wang <[email protected]>
wrote:

> Thanks Andrei. Coordinating the two efforts before either hardens is the
> right call, and I want to do that.
>
> To your direct question: I'd want Tag to stay a separated first-class REST
> concept (TagDefinition with namespace/name, allowed values, inheritability,
> CRUD lifecycle), not built on the labels track as a spec-level dependency.
> The point of standardizing Tag is shared classification semantics across
> systems, and that needs more than allowed_values, the proposal will be
> explicit about normative interpretation (visibility, atomicity, attachment
> value type) so a tag's meaning doesn't drift across catalogs.
>
> Whether a catalog persists or indexes tag assignments using the same
> machinery as labels (reserved-namespace pattern, dedicated reverse-lookup
> index, or something else) reads as a catalog implementation choice to me.
> I'd rather the spec leave that open than commit one shape, and the REST
> contract should expose tag assignments through tag-specific semantics, not
> through reserved-namespace labels.
>
> Governed/Standard maps to your catalog-managed vs client-writable axis.
> I'll align terminology where the surfaces overlap.
>
> One scope point I want to keep clean: this proposal is the classification
> input side. Policy enforcement stays separate, on the read-restrictions
> track (#13879).
>
> A sync before I write the full design doc would help. Paired proposals
> with a coordinated boundary is the shape I'd target. Slack works, or any
> morning your timezone, send a few options.
>
> -ej
>
> On Tue, May 12, 2026 at 1:38 AM Andrei Tserakhau via dev <
> [email protected]> wrote:
>
>> Hi EJ,
>>
>> Thanks for starting this thread. I think there is overlap with my labels
>> work in flight, and it would be good to converge the two efforts rather
>> than end up with two parallel attachment mechanisms.
>>
>> Current labels work:
>>
>> - Proposal (#15521):
>>
>> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit
>> - OpenAPI PR (#15750): generic Labels primitive — flat k/v on tables,
>> views,
>>   namespaces, and columns by field-id.
>> - CRUD follow-up:
>>
>> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit?tab=t.emq0gkbmc7bx#heading=h.ijaa62gyvv30
>> -- UpdateLabels, ETags, atomicity, catalog-managed vs client-writable keys
>> and SQL DDL surface.
>>
>> My read is that labels cover most of the attachment side of your proposal:
>> tables, columns, views, namespaces, and the use cases you listed. Where
>> your proposal adds something new is the management model: first-class Tag
>> definitions with namespace-scoped identity, allowed values, inheritance
>> rules, and reverse lookup.
>>
>> That part is worth designing.
>>
>> The cleanest framing I see is:
>>
>> 1. Labels = attachment mechanism. Generic k/v on tables, columns, views,
>> and namespaces.
>> 2. Tags = governance model on top. Named Tag definitions with allowed
>> values and management endpoints. One possible implementation is to store
>> tag attachments as labels under a reserved namespace, with reverse lookup
>> as a dedicated endpoint indexed over labels.
>>
>> Two things worth aligning early: your Governed/Standard split looks close
>> to
>> the same axis as catalog-managed vs client-writable in the CRUD
>> follow-up, so we should probably reconcile the terminology. Allowed values
>> on Tag definitions also seem like the structural answer to the interop
>> concern Yufei raised on both threads.
>>
>> Would you be open to building the Tag entity work on top of the labels
>> track?
>> That could be a section in the existing docs, a paired proposal, or
>> whatever
>> shape works best.
>>
>> Happy to chat on Slack or set up a quick sync before you invest in the
>> full
>> design doc.
>>
>> Thanks,
>> Andrei
>>
>> On Tue, May 12, 2026 at 1:27 AM Yufei Gu <[email protected]> wrote:
>>
>>> Hi EJ,
>>>
>>> Thanks for sharing this.
>>>
>>> Tagging is useful as a lightweight way to categorize objects. It can
>>> help with common cases like basic classification, ownership or cost
>>> attribution, and simple discovery or filtering. That said, even with this
>>> lightweight framing, I’m still a bit concerned about how different catalog
>>> implementations and engines will interpret tags and whether we can make
>>> them truly interoperable. In practice, small differences in semantics or
>>> expectations could lead to fragmentation across catalogs.
>>>
>>> I would also be cautious about layering in governance or policy related
>>> semantics too early, as that may further increase the risk of inconsistent
>>> interpretations.
>>>
>>> Yufei
>>>
>>>
>>> On Mon, May 4, 2026 at 3:55 PM EJ Wang <[email protected]>
>>> wrote:
>>>
>>>> Hi folks,
>>>>
>>>> I'm new to the Iceberg community, currently contributing to Polaris OSS
>>>> on the tagging design. Before going deeper into a design doc, I want to
>>>> surface the direction on this list and invite early input from people with
>>>> more context on how IRC-level concepts get shaped here.
>>>>
>>>> Polaris users are asking for a classification primitive that covers
>>>> compliance (PII, sensitivity, data domain), ownership and cost attribution,
>>>> and AI or semantic hints on columns. My read is that we will build this
>>>> regardless, but designing it inside Polaris alone reduces its value.
>>>> Governance tools would need per-catalog adapters. If the shape is
>>>> standardized at the IRC level, the ecosystem benefits far more broadly.
>>>>
>>>> Across catalogs and governance platforms, the tag concept has
>>>> independently converged on a similar shape: a first-class Tag entity with
>>>> identity (name + namespace), optional schema (allowed values,
>>>> inheritability), and attachments to objects carrying a value. Snowflake
>>>> tags, Unity Catalog governed tags, Google Cloud Dataplex tag templates,
>>>> Apache Atlas classifications, Apache Gravitino tags, and DataHub tags all
>>>> expose this pattern, across ownership, FinOps, AI reasoning, and governance
>>>> use cases. When independent products converge, my read is that the shape is
>>>> the natural decomposition rather than a vendor-specific artifact.
>>>>
>>>> Two adjacent efforts are already in flight. The read-restrictions
>>>> proposal (apache/iceberg#13879
>>>> <https://github.com/apache/iceberg/issues/13879>) delivers enforcement
>>>> to engines. A Tag proposal would complement it as the classification input
>>>> side, so catalogs can resolve tag-driven enforcement internally and deliver
>>>> the outcome via read-restrictions. The labels proposal (
>>>> apache/iceberg#15521 <https://github.com/apache/iceberg/issues/15521>)
>>>> serves
>>>> generic catalog-managed metadata. My read is that a first-class Tag
>>>> with identity and lifecycle is distinct from labels; they solve different
>>>> problems and can coexist.
>>>>
>>>> At a high level, I think the minimum valuable scope in the IRC spec is:
>>>> a Tag entity with CRUD at the namespace level, tag attachments with target
>>>> and value applied to tables, columns via field-id, views, and namespaces, a
>>>> reverse lookup endpoint for "find objects with tag X", tag attachment
>>>> retrieval via a dedicated endpoint, and a small set of normative clauses on
>>>> privilege enforcement, visibility filtering, and rename atomicity.
>>>> Resolved tags do not need to live in LoadTableResult.
>>>>
>>>> Things I'd like to keep out of the core spec as layered extensions, not
>>>> first pass: typed multi-field per-attachment values (Atlas, Dataplex;
>>>> addable non-breaking later), a Governed-vs-Standard type distinction (Unity
>>>> Catalog's pattern can be expressed through configuration), and
>>>> tag-to-policy binding (belongs in a separate Policy authoring phase).
>>>>
>>>> What I'm asking: early feedback on whether this direction fits the IRC
>>>> roadmap, pointers to prior discussions I may have missed, and interest in
>>>> co-championing from contributors outside Polaris. I'll follow up with a
>>>> full design doc in the coming week. An issue placeholder is at
>>>> apache/iceberg#16165 <https://github.com/apache/iceberg/issues/16165> for
>>>> tracking.
>>>>
>>>> -ej
>>>>
>>>

Re: [DISCUSS] First-class Tag concept in Iceberg REST Catalog

Reply via email to