Hi Aitozi, Thanks for the proposal. Supporting physical column-level updates for primary key tables is a meaningful direction — especially for GDPR masking, where partial-update's merge-on-read semantics leave old sensitive values in data files until compaction and snapshot expiration.
I've reviewed the PIP-44 design document against the current Paimon codebase and have the following comments and questions. Overall direction: +1 Reusing the existing Data Evolution infrastructure (firstRowId-based file alignment, writeCols tracking, union read) is the right approach. The core abstractions — DataEvolutionSplitRead, DataEvolutionFileReader, and the FieldBunch pattern — already solve multi-file column merging for append-only tables. Extending this to primary key tables is a natural evolution. 1. Read path: Data Evolution union read + MergeTree MOR pipeline composition The proposal describes: "Read path first performs Data Evolution union read to reconstruct full rows, then converts to KeyValue and passes to the existing MergeTree MOR path." This two-stage pipeline raises some concerns: - Filter push-down limitation. DataEvolutionSplitRead explicitly does not support predicate push-down (see the class-level Javadoc: "this class does not support filtering push down and deletion vectors, as they can interfere with the process of merging columns"). For primary key tables, MergeFileSplitRead currently pushes key-range filters into overlapped sections and all filters into non-overlapped sections. After Data Evolution union read reconstructs full rows, how are filters applied? If filtering can only happen after full-row reconstruction, this could regress scan performance for wide tables with selective predicates. - Deletion vector coordination. DataEvolutionSplitRead also does not support deletion vectors. Primary key tables with deletion vectors enabled (BucketedDvMaintainer) rely on DV application during the read. The design should clarify: are deletion vectors applied before or after the union read? If column-family files are rewritten independently, do deletion vectors need to become column-family-scoped, or does a single DV still apply to the aligned row group? - Sequence number handling. MergeFileSplitRead currently auto-includes sequence fields in the read type (see adjustActualReadType), and PartialUpdateMergeFunction relies on sequence group comparators to resolve conflicts. When column families are stored in separate physical files, which file carries the sequence number? If the sequence field is in column-family A but the update touches column-family B, how is ordering resolved? 2. firstRowId semantics for primary key tables The proposal states: "For primary key tables, firstRowId is only an internal alignment id for grouping physical files. It is not row tracking and should not be exposed as a user-visible row id." This is an important distinction. Currently, firstRowId in append-only tables is assigned by AppendOnlyWriter using a LongCounter and is monotonically increasing. For primary key tables, the alignment story is different because: - LSM compaction rewrites files. When a compaction produces new files in a higher level, how is firstRowId assigned? The rows in the compacted file have been merged — their original alignment with column-family files may no longer hold. - Key-ordered files. Primary key table files are sorted by key, not by insertion order. The proposal should clarify how firstRowId alignment is maintained across compaction for different column families. I think this is the hardest part of the design. Can you elaborate on the firstRowId assignment strategy during compaction — specifically, whether all column families in the same row group must be compacted together, or whether they can be compacted independently? 3. Column family definition and configuration The proposed configuration style: 'column-family.cols.profile' = 'name,age' 'column-family.cols.privacy' = 'email,phone,address' 'column-family.cols.metrics' = 'score,level' A few questions: - Primary key columns. Are primary key columns implicitly in every column family (since they're needed for merge)? Or is there a "base" family that always contains the PK + sequence fields? - Unassigned columns. What happens to columns not assigned to any family? Are they in a default family, or is it an error? The proposal states: "For primary key tables, firstRowId is only an internal alignment id for grouping physical files. It is not row tracking and should not be exposed as a user-visible row id." This is an important distinction. Currently, firstRowId in append-only tables is assigned by AppendOnlyWriter using a LongCounter and is monotonically increasing. For primary key tables, the alignment story is different because: - LSM compaction rewrites files. When a compaction produces new files in a higher level, how is firstRowId assigned? The rows in the compacted file have been merged — their original alignment with column-family files may no longer hold. - Key-ordered files. Primary key table files are sorted by key, not by insertion order. The proposal should clarify how firstRowId alignment is maintained across compaction for different column families. I think this is the hardest part of the design. Can you elaborate on the firstRowId assignment strategy during compaction — specifically, whether all column families in the same row group must be compacted together, or whether they can be compacted independently? 3. Column family definition and configuration The proposed configuration style: 'column-family.cols.profile' = 'name,age' 'column-family.cols.privacy' = 'email,phone,address' 'column-family.cols.metrics' = 'score,level' A few questions: - Primary key columns. Are primary key columns implicitly in every column family (since they're needed for merge)? Or is there a "base" family that always contains the PK + sequence fields? - Unassigned columns. What happens to columns not assigned to any family? Are they in a default family, or is it an error? - Interaction with sequence groups. PartialUpdateMergeFunction already supports fields.<field>.sequence-group for grouping fields by sequence number. What is the relationship between column families and sequence groups? Can a sequence group span multiple column families? If not, how is this validated? 4. Write path: column-family-scoped file rewriting The proposal mentions: "Paimon can rewrite only the affected column-family files, remove the old files from active metadata after commit." This implies a new commit operation that: - Reads the existing data for only the affected column families - Rewrites those column-family files with updated values - Atomically removes old files and adds new files in the manifest How does this interact with concurrent writes? If a normal upsert and a column-family rewrite happen concurrently on the same bucket, the conflict detection logic in ConflictDetection (which already has special handling for dataEvolutionEnabled) needs to understand column-family boundaries to avoid false conflicts — or must it serialize all writes? 5. Compaction strategy This is closely related to point 2. Today, compaction in MergeTreeWriter operates on all columns together. With column families: - Must compaction always process all families in a row group together to maintain alignment? - Can individual families be compacted independently (which would be a major efficiency win for the "update only privacy columns" use case)? - If families compact independently, how do you handle the case where family A has 5 levels but family B has 2 levels — the merge-on-read path needs aligned row counts across families. 6. Compatibility and migration - How does an existing primary key table migrate to column-family mode? Is it an ALTER TABLE operation? Does it require rewriting all existing data files? - What happens if a user adds a column via schema evolution that isn't assigned to any family? Minor comments: - The design document could benefit from a concrete example of the file layout before and after a column-family update, showing the manifest entries and how firstRowId alignment works. - Consider whether the column family metadata should be stored in the table schema (and thus versioned with schema evolution) or as table options. - For the GDPR use case specifically, it would be helpful to clarify the end-to-end guarantee: after a column-family rewrite + snapshot expiration, are old values provably unreachable? Or can orphan files still contain them until cleanup runs? Overall, I'm supportive of this direction. The key technical challenge is the firstRowId alignment strategy across compaction cycles for primary key tables — I think getting that right will unlock everything else cleanly. Regards, Nicholas Jiang On 2026/06/06 12:19:40 Aitozi wrote: > Hi community, > > I would like to start a discussion about supporting column-level updates > for primary key tables. [1] > > Primary key tables are widely used for upsert workloads, and users often > need to update only a subset of columns in wide tables. Typical scenarios > include historical data backfill, profile correction, metric recalculation, > and GDPR-like masking or removal of sensitive columns. > > Today, partial-update can express column-level logical updates, but it > appends update records and relies on merge-on-read or compaction to produce > the latest visible value. This may add extra cost to the read path. For > physical removal scenarios, old sensitive values may also remain in old > data files until compaction and snapshot expiration. > > This proposal aims to support two capabilities: > > 1. Physical column-level updates for primary key tables. > 2. Predefined column families, so columns that are frequently updated > together can be stored in dedicated physical files and replaced as a > column-family unit. > > The proposal reuses the existing Data Evolution idea, but applies it to > primary key tables with a different firstRowId semantic. For primary key > tables, firstRowId is only an internal alignment id for grouping physical > files. It is not row tracking and should not be exposed as a user-visible > row id. > > The basic model is: > > 1. Files are aligned by (partition, bucket, firstRowId, rowCount). > 2. firstRowId is used for file-group alignment. > 3. rowIndex is used for row alignment inside the file group. > 4. Read path first performs Data Evolution union read to reconstruct > full rows. > 5. Then the reconstructed rows are converted to KeyValue and passed to > the existing MergeTree MOR path. > > With column families, users may define groups such as: > 'column-family.cols.profile' = 'name,age' 'column-family.cols.privacy' = > 'email,phone,address' 'column-family.cols.metrics' = 'score,level' > > For fixed update scenarios, such as sensitive-column masking, Paimon can > rewrite only the affected column-family files, remove the old files from > active metadata after commit, and let snapshot expiration or orphan cleanup > physically delete them later. > > > Looking forward to your feedback. > > [1]: > https://cwiki.apache.org/confluence/display/PAIMON/PIP-44%3A+Support+Data+Evolution+for+Primary+Key+Tables > > Best, > Aitozi. >
