Hi all, Recap from the dedicated labels sync held on May 28, 2026 (recording [1]).
Summary of the discussion: - Strong consensus to land the read API first, with the write API as a separate follow-up proposal (Ryan, Sung, Kevin, Christian aligned). Christian raised a concern that the write half could lag behind (Trino views precedent); to address this, the proposal will document the write-path direction alongside the read API. - Labels remain flat key-value pairs, no internal structure. Kubernetes labels precedent invoked — flat shape, conventions via well-known prefixes, no spec-defined vocabulary. Namespace-as-attribute (raised by Uladzimir Makaranka, Polaris) discussed and set aside in favor of prefix conventions. - Labels-vs-Tag boundary: labels are the wire-protocol mechanism for cross-catalog metadata exchange (this proposal); tags are catalog-internal structured concepts (Snowflake, UC, Polaris each have their own shape). Standardizing Tag itself as a first-class spec entity is a separate effort, not in scope for V1. EJ Wang's parallel Tag proposal on dev@ [2] is in that direction. - Governance scope: Prashant Singh raised concerns about positioning labels as a governance protocol — provenance, identity mapping across IDPs, inheritance semantics. Room aligned that labels are broader than governance — semantic metadata exchange is the load-bearing case; governance remains a valid use case among many, and whether to use labels for governance is a catalog-level decision rather than a spec mandate. Policy decisions and enforcement live in read restrictions (PR #13879 [3]) — a parallel and complementary track. - Write API shape converging on an independent CRUD endpoint (UpdateLabels-style verb) with a transactional path for atomic table+label operations at create/alter time. Two-class distinction (catalog-authored vs externally-managed labels) reaffirmed; Ryan noted not all labels should be editable via CRUD since many are produced by the catalog through inheritance, classification, or automated paths. - Bulk APIs surfaced as a real need for both read (inverted index — finding tables/columns matching given labels) and write (applying labels at scale, classifier batch operations). Scoped for inclusion in the write API proposal. - Pattern for adding new first-class REST concepts (labels, UDFs, indexes, etc.): independent CRUD endpoint per concept, paired with a transactional path for atomic operations alongside table create/alter. Useful reference shape for future spec additions. Post-sync follow-ups already in motion: - Hot-path discipline added to the proposal in response to Christian Thiel's doc comment: LoadTableResponse latency MUST NOT increase due to labels; how catalogs meet this is implementation-defined (caching, freshness trade-offs, filtering). Capability negotiation — parallel to the work on PR #13879 [3] — is a future direction. - Use case split (high-confidence cross-catalog: semantic, domain, classification, sensitivity vs platform-specific: owner, principals, anything identity-bound) agreed after offline follow-up with Prashant; will be reflected in the next proposal revision. - A separate [DISCUSS] thread will land the substrate framing publicly. Next sync approximately three / four weeks out. Tentative agenda: labels/Tag boundary update, write-path sketch walk-through, path to VOTE on the read API. Thanks to everyone who joined and to those continuing to engage on the design doc [4] and spec PR #15750 [5]. Best, Andrei [1] https://youtu.be/P4NOQASNtPA [2] https://lists.apache.org/thread/r5r3vpmrfy9wmmb4sdybwcjz1c4wld5b [3] https://github.com/apache/iceberg/pull/13879 [4] https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit [5] https://github.com/apache/iceberg/pull/15750 On Wed, May 27, 2026 at 12:12 AM Andrei Tserakhau < [email protected]> wrote: > Quick update on this on: > we'll cover this on the Dedicated Sync this Thursday (10-11am US / 7-8pm > CET). Thanks to Daniel Weeks for getting it on the calendar. > > Last time labels was on the sync was 2026-04-15. Plenty of productive > offline discussion since then, mostly in the gdoc comment threads. Thanks > to everyone who engaged: > > - *Daniel Weeks* — for the IRC-spec-vs-table-spec framing that now > anchors the Alternatives section > - *Fokko Driesprong* — for challenging motivation on the cost-based > defense and driving the ownership reframe > - *Yufei Gu* — for the structure debate that landed us on the split > shape > - *Sung Yun* — for the early consumption-pattern and addressing > questions > - *Maninder Parmar* — for the properties-relationship probing > - *Christian Thiel* — for pushing on the write API direction > > Concrete changes in-doc since April: > > - Problem Statement reframed around catalog-owned metainformation as > the load-bearing concept. > - Alternatives Considered rewritten with the IRC-spec-vs-table-spec > boundary instead of cost arguments. > - Structure debate closed on a split shape: labels (flat k/v at the > table level, k8s-style) + column-labels (array with field-id). Labels > type itself is flat — no internal structure. Same shape applies on > LoadViewResponse and namespaces. > - CRUD companion as a second tab in the same gdoc — UpdateLabels REST > verb, two-class distinction for catalog-managed vs externally-managed keys, > optimistic concurrency with ETags. > - Working Trino prototype at > https://github.com/laskoviymishka/irc-labels/pull/1 — native ALTER > TABLE ... SET LABEL DDL translating end-to-end. > > Parallel work to flag: EJ Wang's first-class Tag concept > <https://lists.apache.org/thread/r5r3vpmrfy9wmmb4sdybwcjz1c4wld5b> > proposal on dev@. We've agreed to coordinate as paired proposals — Tag as > a separate first-class REST concept, labels as the lower-level attachment > substrate. Both efforts share the cross-cutting interop question. > > Goal on Thursday is to walk through the current state, confirm the > split-shape lands cleanly, and identify what's needed to move toward a VOTE > on the read API. Anyone reading along is welcome to join. > > Doc (current state): > https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit > > Thanks, > Andrei > > On Tue, Mar 24, 2026 at 9:35 PM Andrei Tserakhau < > [email protected]> wrote: > >> Thanks Ryan! >> >> Your point about avoiding first-class metadata requirements is exactly >> the design principle here. Labels let each catalog surface what it knows >> without the spec dictating what catalogs must track. >> >> To build on this, I put together a POC showing the approach works across >> the ecosystem. >> >> Key design principles that held up in practice: >> >> - No new requirements on catalogs. Labels are optional in the response. A >> catalog that doesn't serve labels returns the same response as today. >> >> - Catalog-scoped, not table state. Every catalog we tried already has >> internal metadata separate from Iceberg properties — Polaris has >> internalProperties, UC has uc_properties, Lakekeeper has namespace >> properties in PostgreSQL. Labels just give this existing metadata a >> standard way through the protocol. >> >> - No property overriding. Labels are explicitly separate from table >> properties. Properties configure behavior, labels describe context. Engines >> know which is which. >> >> What built: >> >> - Spec change: https://github.com/apache/iceberg/pull/15750 >> - PyIceberg client: https://github.com/apache/iceberg-python/pull/3191 >> >> Catalog implementations: >> - Polaris: https://github.com/apache/polaris/pull/4048 (labels from >> internalProperties) >> - Unity Catalog OSS: >> https://github.com/unitycatalog/unitycatalog/pull/1417 (labels from >> uc_properties) >> - Lakekeeper: https://github.com/lakekeeper/lakekeeper/pull/1676 (labels >> from namespace properties) >> >> Full demo: https://github.com/laskoviymishka/irc-labels >> >> Three catalogs, two languages (Java + Rust), 40-95 lines each. The >> pattern is the same everywhere, each catalog already has internal metadata >> that doesn't belong in table properties. Labels give it a standard way out >> through the protocol. >> >> The Polaris implementation also addresses >> https://github.com/apache/polaris/issues/3222 - the community has been >> asking for a way to surface business metadata alongside table loads. Labels >> solve this without adding any requirements beyond an optional field. >> >> Beyond ownership and classification, the demo also shows labels enabling >> AI agent table selection (agents reason about tables using semantic labels >> instead of guessing from column names) and governance via trusted engine >> (ClickHouse reading sensitivity labels to auto-generate masking policies). >> >> Happy to discuss the spec design or any of the implementation details. >> >> Andrei >> >> On Fri, Mar 6, 2026 at 11:25 PM Ryan Blue <[email protected]> wrote: >> >>> I think that this is a reasonable way to solve some persistent issues >>> that we've seen. >>> >>> Many catalogs track additional metadata that is not part of the table >>> spec (or others) like "owner", and right now there is no way to exchange or >>> share that information. I'm also hesitant to start including it as >>> first-class metadata because that puts additional requirements on catalogs >>> that may not align. For instance, Tabular had no concept of a table "owner" >>> and instead used default grants at the schema level. I like that this >>> solution allows catalogs to provide information in a generic way that >>> doesn't add requirements in the REST spec. And it is an alternative to >>> overriding table properties with catalog-managed information, which I think >>> is an anti-pattern. >>> >>> Thanks, Andrei! I think this is a good idea. >>> >>> On Thu, Mar 5, 2026 at 2:04 PM Andrei Tserakhau via dev < >>> [email protected]> wrote: >>> >>>> Hi all, >>>> >>>> `LoadTableResponse` returns table metadata — schema, snapshots, file >>>> locations — but catalogs have operational context about tables that has no >>>> standard place to go: cost attribution, ownership, governance hints, >>>> semantic metadata. Right now catalogs have two options: >>>> >>>> 1. Properties — durable, commit-versioned table state. Good for >>>> persistent metadata; wrong for ephemeral catalog context. >>>> 2. Custom fields — catalog-specific extensions with no >>>> interoperability. Each catalog invents its own structure; engines have no >>>> basis to read them. >>>> >>>> The community has already identified this gap. Polaris opened an issue >>>> [1] requesting a standard extension point in the IRC protocol for >>>> catalog-managed metadata. Two earlier threads [2][3] explored column-level >>>> metadata, though in the context of table format changes. >>>> >>>> We propose adding an optional `labels` field to `LoadTableResponse` for >>>> catalog-managed metadata. Labels are string key-value pairs generated >>>> per-request from the catalog's internal systems; nothing is written to >>>> table files. Engines may use or ignore them entirely. Labels give catalog >>>> providers a standard channel to surface context to any client without >>>> bilateral custom integrations for every catalog-engine pair. >>>> >>>> Details: >>>> - GitHub Issue: apache/iceberg#15521 >>>> - Design Document: [4] >>>> >>>> Please review the proposal and share your feedback. >>>> >>>> Thanks, >>>> Andrei >>>> >>>> [1]: https://github.com/apache/polaris/issues/3222 >>>> [2]: https://lists.apache.org/thread/vwrc3m534gfyfjnsfflwtgkg158yzrb4 >>>> [3]: https://lists.apache.org/thread/yflg8w1h87qgwc4s3qtog4l8nx8nk8m0 >>>> [4]: >>>> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit?usp=sharing >>>> >>>
