Hi community, I would like to start a discussion about supporting column-level updates for primary key tables. [1]
Primary key tables are widely used for upsert workloads, and users often need to update only a subset of columns in wide tables. Typical scenarios include historical data backfill, profile correction, metric recalculation, and GDPR-like masking or removal of sensitive columns. Today, partial-update can express column-level logical updates, but it appends update records and relies on merge-on-read or compaction to produce the latest visible value. This may add extra cost to the read path. For physical removal scenarios, old sensitive values may also remain in old data files until compaction and snapshot expiration. This proposal aims to support two capabilities: 1. Physical column-level updates for primary key tables. 2. Predefined column families, so columns that are frequently updated together can be stored in dedicated physical files and replaced as a column-family unit. The proposal reuses the existing Data Evolution idea, but applies it to primary key tables with a different firstRowId semantic. For primary key tables, firstRowId is only an internal alignment id for grouping physical files. It is not row tracking and should not be exposed as a user-visible row id. The basic model is: 1. Files are aligned by (partition, bucket, firstRowId, rowCount). 2. firstRowId is used for file-group alignment. 3. rowIndex is used for row alignment inside the file group. 4. Read path first performs Data Evolution union read to reconstruct full rows. 5. Then the reconstructed rows are converted to KeyValue and passed to the existing MergeTree MOR path. With column families, users may define groups such as: 'column-family.cols.profile' = 'name,age' 'column-family.cols.privacy' = 'email,phone,address' 'column-family.cols.metrics' = 'score,level' For fixed update scenarios, such as sensitive-column masking, Paimon can rewrite only the affected column-family files, remove the old files from active metadata after commit, and let snapshot expiration or orphan cleanup physically delete them later. Looking forward to your feedback. [1]: https://cwiki.apache.org/confluence/display/PAIMON/PIP-44%3A+Support+Data+Evolution+for+Primary+Key+Tables Best, Aitozi.
