Hi Chandra, There has been recent discussion (and community calls) on adding constraint support (including PRIMARY KEY). Could you take a look at the proposal and see where your ideas fit within and maybe conflict and/or extend it? https://docs.google.com/document/d/1re65fx3uqC7I_tJuS79IxLiB7HEN2Grt5qRIDjd3p-4/edit?tab=t.0#heading=h.o38ny2ndrd79
It would be great to bring your ideas to that venue. Thanks, Matt On Sat, Jun 20, 2026 at 12:34 AM chandra sekhar k < [email protected]> wrote: > Hi Iceberg Community, > > We would like to start a discussion about introducing native primary-key > table support in Apache Iceberg. > > Background > ========== > > Apache Iceberg has become a widely adopted table format for large-scale > analytic datasets and provides strong support for schema evolution, > partition evolution, row-level operations, and incremental processing. > > At the same time, an increasing number of users are building CDC-driven > and operational analytics workloads where data is naturally organized > around primary keys and continuously updated through inserts, updates, and > deletes. > > While Iceberg provides important building blocks such as identifier > fields, equality deletes, position deletes, and MERGE operations, there is > currently no standardized primary-key table abstraction within the Iceberg > specification. > > Motivation > ========== > > Many modern data lake workloads rely on: > > * Database CDC ingestion > * Streaming upsert pipelines > * Data synchronization between transactional systems and data lakes > * Near real-time operational analytics > * Incremental changelog consumption > > These workloads often require: > > * Primary-key based update semantics > * Efficient handling of high-frequency updates and deletes > * Storage layouts optimized for mutable data > * Efficient compaction strategies > * Standardized changelog generation and consumption > > Today, users typically implement these capabilities through > engine-specific solutions or custom ingestion frameworks, which can lead to > inconsistent behavior across engines and increased operational complexity. > > Existing Iceberg Capabilities and Gaps > ====================================== > > Iceberg already provides several important capabilities for mutable > datasets: > > * Identifier fields > * Equality deletes > * Position deletes > * MERGE INTO support through compute engines > * Incremental snapshot processing > > However, these features primarily serve as low-level primitives and do not > provide a complete primary-key table model. > > For example: > > * Identifier fields define row identity but do not provide write semantics. > * MERGE operations are engine-specific and may behave differently across > engines. > * Equality deletes can become expensive for heavy CDC workloads. > * There is currently no standard mechanism for organizing data around > primary keys or exposing changelog semantics. > > As a result, users building CDC and streaming upsert workloads often need > significant custom infrastructure on top of Iceberg. > > Industry Context > ================ > > Several lakehouse systems have introduced native support for > primary-key-oriented workloads. > > For example, Apache Paimon provides primary-key tables with built-in > support for upserts, changelog production, and storage layouts optimized > for mutable data. These capabilities have proven useful for streaming and > CDC scenarios. > > At the same time, many organizations have already standardized on Iceberg > as their table format and would benefit from similar capabilities without > requiring adoption of a separate table format. > > This raises the question of whether a standardized primary-key table > abstraction should be part of Iceberg itself. > > Initial Proposal > ================ > > We would like to discuss introducing a first-class primary-key table > abstraction in Iceberg. > > Conceptually, users could define tables such as: > > CREATE TABLE orders ( > order_id BIGINT PRIMARY KEY, > customer_id BIGINT, > amount DECIMAL(18,2), > updated_at TIMESTAMP > ); > > The intent is not to provide OLTP-style uniqueness enforcement or database > constraints. > > Instead, the goal is to provide a standard storage and processing model > for mutable datasets organized around primary keys. > > Potential capabilities could include: > > * Primary-key metadata stored as part of table metadata > * Standardized primary-key write semantics > * Primary-key aware compaction and maintenance > * Efficient changelog generation for downstream consumers > * Optimized storage organization for mutable workloads > * Consistent behavior across engines > > The feature would be optional and would not affect existing Iceberg tables > or workloads. > > Open Questions > ============== > > We would appreciate feedback from the community on the following topics: > > 1. Is a native primary-key table abstraction within the scope and vision > of Iceberg? > > 2. Are existing Iceberg features sufficient to address these use cases? > > 3. What are the advantages or disadvantages of introducing primary-key > semantics at the table-format level? > > 4. Should Iceberg standardize changelog and mutable-data handling for CDC > workloads? > > 5. What compatibility or interoperability concerns should be considered? > > 6. Would the community be interested in reviewing a detailed design > proposal if there is agreement on the problem statement? > > At Huawei, we have been experimenting with primary-key table semantics in > production environments for CDC-driven and mutable-data workloads. The > experience has highlighted both the demand for these capabilities and the > challenges of building them consistently on top of existing primitives. > Based on these experiences, we would like to discuss whether a standardized > approach belongs in Iceberg. > > If there is interest from the community, we would be happy to share a > detailed design proposal covering metadata representation, write/read > semantics, compaction strategies, changelog support, and engine > integrations. > > Looking forward to hearing the community's thoughts. > > Thank you for your consideration, > Chandra Sekhar >
