Re: [DISCUSS] Table and Column Label Metadata in Iceberg REST Catalog

EJ Wang Thu, 11 Jun 2026 18:57:54 -0700

Thanks Andrei for the recap. I want to clarify one point on the
labels-vs-Tag boundary, mostly to avoid having the labels discussion
pre-decide questions that belong in the parallel Tag discussion.


I agree with the use cases motivating labels: exposing catalog-managed
context such as ownership, domain, cost attribution, classification hints,
and semantic hints in REST responses can be useful for engines and clients.
The part I am less sure about is whether those use cases require a separate
Labels concept in the spec, or whether they should be modeled as projected
metadata from a structured Tag/classification model.

My concern is dependency direction. If we introduce labels as a flat
generic primitive first, and later add structure for identity, lifecycle,
allowed values, inheritance, field-id attachment, visibility, and reverse
lookup, then we may end up reconstructing a Tag model around labels. That
feels less clear than defining the structured model directly and allowing
catalogs to project the relevant assignments into REST responses where
useful.

In other words, I don't think the interesting question is only whether
labels should be flat or structured. I think the question is whether labels
should be a separate primitive at all, or whether the read-response use
cases can be covered by a projected view of structured tag/classification
assignments.

Where I'd be especially careful is the phrase that tags are
"catalog-internal structured concepts." I agree that the full Tag
discussion is outside the scope of Labels V1, but I would not want Labels
V1 to pre-decide that structured tagging/classification semantics are only
catalog-internal and not an IRC concept. That is exactly the separate
question being explored in the Tag thread.

The factoring I'd prefer to evaluate is:

   - Tag (structured) classification: authoring, lifecycle, identity,
   field-id attachment, inheritance, visibility, and lookup semantics
   - REST response projection: optional metadata returned to clients,
   potentially derived from structured tag assignments
   - read-restrictions: enforcement result delivered to engines

That framing may reduce the need for a separate Labels primitive while
still preserving the read-response use cases that motivated the labels
proposal.

I realize this may be a bigger factoring question than Labels V1 intended
to answer, but I think it is worth making explicit before the two threads
diverge. If the community wants one logical concept rather than both labels
and tags, I think we should at least evaluate the direction where the
structured Tag/classification model is the source of truth and lightweight
REST response metadata is a projection from it, before standardizing labels
as an independent primitive.

-ej

On Thu, Jun 11, 2026 at 11:53 AM Andrei Tserakhau via dev <
[email protected]> wrote:

> Hi all,
>
> Recap from the dedicated labels sync held on May 28, 2026
> (recording [1]).
>
> Summary of the discussion:
>
>    -
>
>    Strong consensus to land the read API first, with the write API
>    as a separate follow-up proposal (Ryan, Sung, Kevin, Christian
>    aligned). Christian raised a concern that the write half could
>    lag behind (Trino views precedent); to address this, the
>    proposal will document the write-path direction alongside the
>    read API.
>    -
>
>    Labels remain flat key-value pairs, no internal structure.
>    Kubernetes labels precedent invoked — flat shape, conventions
>    via well-known prefixes, no spec-defined vocabulary.
>    Namespace-as-attribute (raised by Uladzimir Makaranka, Polaris)
>    discussed and set aside in favor of prefix conventions.
>    -
>
>    Labels-vs-Tag boundary: labels are the wire-protocol mechanism
>    for cross-catalog metadata exchange (this proposal); tags are
>    catalog-internal structured concepts (Snowflake, UC, Polaris
>    each have their own shape). Standardizing Tag itself as a
>    first-class spec entity is a separate effort, not in scope for
>    V1. EJ Wang's parallel Tag proposal on dev@ [2] is in that
>    direction.
>    -
>
>    Governance scope: Prashant Singh raised concerns about
>    positioning labels as a governance protocol — provenance,
>    identity mapping across IDPs, inheritance semantics. Room
>    aligned that labels are broader than governance — semantic
>    metadata exchange is the load-bearing case; governance remains
>    a valid use case among many, and whether to use labels for
>    governance is a catalog-level decision rather than a spec
>    mandate. Policy decisions and enforcement live in read
>    restrictions (PR #13879 [3]) — a parallel and complementary
>    track.
>    -
>
>    Write API shape converging on an independent CRUD endpoint
>    (UpdateLabels-style verb) with a transactional path for atomic
>    table+label operations at create/alter time. Two-class
>    distinction (catalog-authored vs externally-managed labels)
>    reaffirmed; Ryan noted not all labels should be editable via
>    CRUD since many are produced by the catalog through inheritance,
>    classification, or automated paths.
>    -
>
>    Bulk APIs surfaced as a real need for both read (inverted index
>    — finding tables/columns matching given labels) and write
>    (applying labels at scale, classifier batch operations). Scoped
>    for inclusion in the write API proposal.
>    -
>
>    Pattern for adding new first-class REST concepts (labels, UDFs,
>    indexes, etc.): independent CRUD endpoint per concept, paired
>    with a transactional path for atomic operations alongside table
>    create/alter. Useful reference shape for future spec additions.
>
> Post-sync follow-ups already in motion:
>
>    -
>
>    Hot-path discipline added to the proposal in response to
>    Christian Thiel's doc comment: LoadTableResponse latency MUST
>    NOT increase due to labels; how catalogs meet this is
>    implementation-defined (caching, freshness trade-offs,
>    filtering). Capability negotiation — parallel to the work on
>    PR #13879 [3] — is a future direction.
>    -
>
>    Use case split (high-confidence cross-catalog: semantic, domain,
>    classification, sensitivity vs platform-specific: owner,
>    principals, anything identity-bound) agreed after offline
>    follow-up with Prashant; will be reflected in the next proposal
>    revision.
>    -
>
>    A separate [DISCUSS] thread will land the substrate framing
>    publicly.
>
> Next sync approximately three / four weeks out. Tentative agenda:
> labels/Tag boundary update, write-path sketch walk-through, path
> to VOTE on the read API.
>
> Thanks to everyone who joined and to those continuing to engage
> on the design doc [4] and spec PR #15750 [5].
>
> Best,
> Andrei
>
> [1] https://youtu.be/P4NOQASNtPA
> [2] https://lists.apache.org/thread/r5r3vpmrfy9wmmb4sdybwcjz1c4wld5b
> [3] https://github.com/apache/iceberg/pull/13879
> [4]
> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit
> [5] https://github.com/apache/iceberg/pull/15750
>
> On Wed, May 27, 2026 at 12:12 AM Andrei Tserakhau <
> [email protected]> wrote:
>
>> Quick update on this on:
>> we'll cover this on the Dedicated Sync this Thursday (10-11am US / 7-8pm
>> CET). Thanks to Daniel Weeks for getting it on the calendar.
>>
>> Last time labels was on the sync was 2026-04-15. Plenty of productive
>> offline discussion since then, mostly in the gdoc comment threads. Thanks
>> to everyone who engaged:
>>
>>    - *Daniel Weeks* — for the IRC-spec-vs-table-spec framing that now
>>    anchors the Alternatives section
>>    - *Fokko Driesprong* — for challenging motivation on the cost-based
>>    defense and driving the ownership reframe
>>    - *Yufei Gu* — for the structure debate that landed us on the split
>>    shape
>>    - *Sung Yun* — for the early consumption-pattern and addressing
>>    questions
>>    - *Maninder Parmar* — for the properties-relationship probing
>>    - *Christian Thiel* — for pushing on the write API direction
>>
>> Concrete changes in-doc since April:
>>
>>    - Problem Statement reframed around catalog-owned metainformation as
>>    the load-bearing concept.
>>    - Alternatives Considered rewritten with the IRC-spec-vs-table-spec
>>    boundary instead of cost arguments.
>>    - Structure debate closed on a split shape: labels (flat k/v at the
>>    table level, k8s-style) + column-labels (array with field-id). Labels
>>    type itself is flat — no internal structure. Same shape applies on
>>    LoadViewResponse and namespaces.
>>    - CRUD companion as a second tab in the same gdoc — UpdateLabels REST
>>    verb, two-class distinction for catalog-managed vs externally-managed 
>> keys,
>>    optimistic concurrency with ETags.
>>    - Working Trino prototype at
>>    https://github.com/laskoviymishka/irc-labels/pull/1 — native ALTER
>>    TABLE ... SET LABEL DDL translating end-to-end.
>>
>> Parallel work to flag: EJ Wang's first-class Tag concept
>> <https://lists.apache.org/thread/r5r3vpmrfy9wmmb4sdybwcjz1c4wld5b>
>> proposal on dev@. We've agreed to coordinate as paired proposals — Tag
>> as a separate first-class REST concept, labels as the lower-level
>> attachment substrate. Both efforts share the cross-cutting interop question.
>>
>> Goal on Thursday is to walk through the current state, confirm the
>> split-shape lands cleanly, and identify what's needed to move toward a VOTE
>> on the read API. Anyone reading along is welcome to join.
>>
>> Doc (current state):
>> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit
>>
>> Thanks,
>> Andrei
>>
>> On Tue, Mar 24, 2026 at 9:35 PM Andrei Tserakhau <
>> [email protected]> wrote:
>>
>>> Thanks Ryan!
>>>
>>> Your point about avoiding first-class metadata requirements is exactly
>>> the design principle here. Labels let each catalog surface what it knows
>>> without the spec dictating what catalogs must track.
>>>
>>> To build on this, I put together a POC showing the approach works across
>>> the ecosystem.
>>>
>>> Key design principles that held up in practice:
>>>
>>> - No new requirements on catalogs. Labels are optional in the response.
>>> A catalog that doesn't serve labels returns the same response as today.
>>>
>>> - Catalog-scoped, not table state. Every catalog we tried already has
>>> internal metadata separate from Iceberg properties — Polaris has
>>> internalProperties, UC has uc_properties, Lakekeeper has namespace
>>> properties in PostgreSQL. Labels just give this existing metadata a
>>> standard way through the protocol.
>>>
>>> - No property overriding. Labels are explicitly separate from table
>>> properties. Properties configure behavior, labels describe context. Engines
>>> know which is which.
>>>
>>> What built:
>>>
>>> - Spec change: https://github.com/apache/iceberg/pull/15750
>>> - PyIceberg client: https://github.com/apache/iceberg-python/pull/3191
>>>
>>> Catalog implementations:
>>> - Polaris: https://github.com/apache/polaris/pull/4048 (labels from
>>> internalProperties)
>>> - Unity Catalog OSS:
>>> https://github.com/unitycatalog/unitycatalog/pull/1417 (labels from
>>> uc_properties)
>>> - Lakekeeper: https://github.com/lakekeeper/lakekeeper/pull/1676
>>> (labels from namespace properties)
>>>
>>> Full demo: https://github.com/laskoviymishka/irc-labels
>>>
>>> Three catalogs, two languages (Java + Rust), 40-95 lines each. The
>>> pattern is the same everywhere, each catalog already has internal metadata
>>> that doesn't belong in table properties. Labels give it a standard way out
>>> through the protocol.
>>>
>>> The Polaris implementation also addresses
>>> https://github.com/apache/polaris/issues/3222 - the community has been
>>> asking for a way to surface business metadata alongside table loads. Labels
>>> solve this without adding any requirements beyond an optional field.
>>>
>>> Beyond ownership and classification, the demo also shows labels enabling
>>> AI agent table selection (agents reason about tables using semantic labels
>>> instead of guessing from column names) and governance via trusted engine
>>> (ClickHouse reading sensitivity labels to auto-generate masking policies).
>>>
>>> Happy to discuss the spec design or any of the implementation details.
>>>
>>> Andrei
>>>
>>> On Fri, Mar 6, 2026 at 11:25 PM Ryan Blue <[email protected]> wrote:
>>>
>>>> I think that this is a reasonable way to solve some persistent issues
>>>> that we've seen.
>>>>
>>>> Many catalogs track additional metadata that is not part of the table
>>>> spec (or others) like "owner", and right now there is no way to exchange or
>>>> share that information. I'm also hesitant to start including it as
>>>> first-class metadata because that puts additional requirements on catalogs
>>>> that may not align. For instance, Tabular had no concept of a table "owner"
>>>> and instead used default grants at the schema level. I like that this
>>>> solution allows catalogs to provide information in a generic way that
>>>> doesn't add requirements in the REST spec. And it is an alternative to
>>>> overriding table properties with catalog-managed information, which I think
>>>> is an anti-pattern.
>>>>
>>>> Thanks, Andrei! I think this is a good idea.
>>>>
>>>> On Thu, Mar 5, 2026 at 2:04 PM Andrei Tserakhau via dev <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> `LoadTableResponse` returns table metadata — schema, snapshots, file
>>>>> locations — but catalogs have operational context about tables that has no
>>>>> standard place to go: cost attribution, ownership, governance hints,
>>>>> semantic metadata. Right now catalogs have two options:
>>>>>
>>>>> 1. Properties — durable, commit-versioned table state. Good for
>>>>> persistent metadata; wrong for ephemeral catalog context.
>>>>> 2. Custom fields — catalog-specific extensions with no
>>>>> interoperability. Each catalog invents its own structure; engines have no
>>>>> basis to read them.
>>>>>
>>>>> The community has already identified this gap. Polaris opened an issue
>>>>> [1] requesting a standard extension point in the IRC protocol for
>>>>> catalog-managed metadata. Two earlier threads [2][3] explored column-level
>>>>> metadata, though in the context of table format changes.
>>>>>
>>>>> We propose adding an optional `labels` field to `LoadTableResponse`
>>>>> for catalog-managed metadata. Labels are string key-value pairs generated
>>>>> per-request from the catalog's internal systems; nothing is written to
>>>>> table files. Engines may use or ignore them entirely. Labels give catalog
>>>>> providers a standard channel to surface context to any client without
>>>>> bilateral custom integrations for every catalog-engine pair.
>>>>>
>>>>> Details:
>>>>> - GitHub Issue: apache/iceberg#15521
>>>>> - Design Document: [4]
>>>>>
>>>>> Please review the proposal and share your feedback.
>>>>>
>>>>> Thanks,
>>>>> Andrei
>>>>>
>>>>> [1]: https://github.com/apache/polaris/issues/3222
>>>>> [2]: https://lists.apache.org/thread/vwrc3m534gfyfjnsfflwtgkg158yzrb4
>>>>> [3]: https://lists.apache.org/thread/yflg8w1h87qgwc4s3qtog4l8nx8nk8m0
>>>>> [4]:
>>>>> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit?usp=sharing
>>>>>
>>>>

Re: [DISCUSS] Table and Column Label Metadata in Iceberg REST Catalog

Reply via email to