Thanks for the proposal. I'm supportive of bringing physical column-level updates to primary-key tables, especially for GDPR-style masking where today's partial-update still leaves old values in historical files.
>From the current codebase, reusing Data Evolution looks natural: `DataEvolutionSplitRead` / `FieldBunch` already solve multi-file column merging, and `ConflictDetection` already has `dataEvolutionEnabled` logic keyed by `(partition, bucket, firstRowId, rowCount)`. That said, I think the hard part is exactly what Nicholas raised: **how `firstRowId` alignment survives PK-table compaction and independent column-family rewrites**. Append tables assign `firstRowId` monotonically at write time; PK files are key-sorted and frequently rewritten by LSM compaction, so the design needs an explicit rule — e.g. compact/replace **aligned row groups across families together**, or re-assign a shared `(firstRowId, rowCount)` whenever any family in the group is rewritten. On the read path, the proposed "union read → KeyValue → MergeTree MOR" pipeline is workable in principle, but today's implementation gap is real: `DataEvolutionSplitRead.withFilter()` is currently a no-op and the class explicitly does not support deletion vectors, while `MergeFileSplitRead` relies heavily on PK-aware pushdown. I'd want the PIP to spell out where filtering/DV/sequence resolution happen after column reconstruction, and ideally reuse manifest-level pruning similar to `DataEvolutionFileStoreScan` to avoid full wide-row materialization on selective scans. For column families, I'd favor a mandatory **base family (PK + sequence/metadata)** plus explicit validation against existing `sequence-group` config, and clear migration semantics (likely a full rewrite). Happy to discuss further once there's a concrete compaction/firstRowId example in the PIP. On Thu, Jun 18, 2026 at 5:10 AM Nicholas Jiang <[email protected]> wrote: > > Hi Aitozi, > > Thanks for the proposal. Supporting physical column-level updates for primary > key tables is a meaningful direction — especially for GDPR masking, where > partial-update's merge-on-read semantics leave old sensitive values in data > files until compaction and snapshot expiration. > > I've reviewed the PIP-44 design document against the current Paimon codebase > and have the following comments and questions. > > Overall direction: +1 > > Reusing the existing Data Evolution infrastructure (firstRowId-based file > alignment, writeCols tracking, union read) is the right approach. The core > abstractions — DataEvolutionSplitRead, DataEvolutionFileReader, and the > FieldBunch pattern — already solve multi-file column merging for append-only > tables. Extending this to primary key tables is a natural evolution. > > 1. Read path: Data Evolution union read + MergeTree MOR pipeline composition > > The proposal describes: "Read path first performs Data Evolution union read to > reconstruct full rows, then converts to KeyValue and passes to the existing > MergeTree MOR path." > > This two-stage pipeline raises some concerns: > > - Filter push-down limitation. DataEvolutionSplitRead explicitly does not > support predicate push-down (see the class-level Javadoc: "this class does not > support filtering push down and deletion vectors, as they can interfere with > the process of merging columns"). For primary key tables, MergeFileSplitRead > currently pushes key-range filters into overlapped sections and all filters > into non-overlapped sections. After Data Evolution union read reconstructs > full rows, how are filters applied? If filtering can only happen after > full-row reconstruction, this could regress scan performance for wide tables > with selective predicates. > - Deletion vector coordination. DataEvolutionSplitRead also does not support > deletion vectors. Primary key tables with deletion vectors enabled > (BucketedDvMaintainer) rely on DV application during the read. The design > should clarify: are deletion vectors applied before or after the union read? > If column-family files are rewritten independently, do deletion vectors need > to become column-family-scoped, or does a single DV still apply to the aligned > row group? > - Sequence number handling. MergeFileSplitRead currently auto-includes > sequence fields in the read type (see adjustActualReadType), and > PartialUpdateMergeFunction relies on sequence group comparators to resolve > conflicts. When column families are stored in separate physical files, which > file carries the sequence number? If the sequence field is in column-family A > but the update touches column-family B, how is ordering resolved? > > 2. firstRowId semantics for primary key tables > > The proposal states: "For primary key tables, firstRowId is only an internal > alignment id for grouping physical files. It is not row tracking and should > not be exposed as a user-visible row id." > > This is an important distinction. Currently, firstRowId in append-only tables > is assigned by AppendOnlyWriter using a LongCounter and is monotonically > increasing. For primary key tables, the alignment story is different because: > > - LSM compaction rewrites files. When a compaction produces new files in a > higher level, how is firstRowId assigned? The rows in the compacted file have > been merged — their original alignment with column-family files may no longer > hold. > - Key-ordered files. Primary key table files are sorted by key, not by > insertion order. The proposal should clarify how firstRowId alignment is > maintained across compaction for different column families. > > I think this is the hardest part of the design. Can you elaborate on the > firstRowId assignment strategy during compaction — specifically, whether all > column families in the same row group must be compacted together, or whether > they can be compacted independently? > > 3. Column family definition and configuration > > The proposed configuration style: > 'column-family.cols.profile' = 'name,age' > 'column-family.cols.privacy' = 'email,phone,address' > 'column-family.cols.metrics' = 'score,level' > > A few questions: > > - Primary key columns. Are primary key columns implicitly in every column > family (since they're needed for merge)? Or is there a "base" family that > always contains the PK + sequence fields? > - Unassigned columns. What happens to columns not assigned to any family? Are > they in a default family, or is it an error? > The proposal states: "For primary key tables, firstRowId is only an internal > alignment id for grouping physical files. It is not row tracking and should > not be exposed as a user-visible row id." > > This is an important distinction. Currently, firstRowId in append-only tables > is assigned by AppendOnlyWriter using a LongCounter and is monotonically > increasing. For primary key tables, the alignment story is different because: > > - LSM compaction rewrites files. When a compaction produces new files in a > higher level, how is firstRowId assigned? The rows in the compacted file have > been merged — their original alignment with column-family files may no longer > hold. > - Key-ordered files. Primary key table files are sorted by key, not by > insertion order. The proposal should clarify how firstRowId alignment is > maintained across compaction for different column families. > > I think this is the hardest part of the design. Can you elaborate on the > firstRowId assignment strategy during compaction — specifically, whether all > column families in the same row group must be compacted together, or whether > they can be > compacted independently? > > 3. Column family definition and configuration > > The proposed configuration style: > 'column-family.cols.profile' = 'name,age' > 'column-family.cols.privacy' = 'email,phone,address' > 'column-family.cols.metrics' = 'score,level' > > A few questions: > > - Primary key columns. Are primary key columns implicitly in every column > family (since they're needed for merge)? Or is there a "base" family that > always contains the PK + sequence fields? > - Unassigned columns. What happens to columns not assigned to any family? Are > they in a default family, or is it an error? > - Interaction with sequence groups. PartialUpdateMergeFunction already > supports fields.<field>.sequence-group for grouping fields by sequence > number. What is the relationship between column families and sequence groups? > Can a sequence group > span multiple column families? If not, how is this validated? > > 4. Write path: column-family-scoped file rewriting > > The proposal mentions: "Paimon can rewrite only the affected column-family > files, remove the old files from active metadata after commit." > > This implies a new commit operation that: > - Reads the existing data for only the affected column families > - Rewrites those column-family files with updated values > - Atomically removes old files and adds new files in the manifest > > How does this interact with concurrent writes? If a normal upsert and a > column-family rewrite happen concurrently on the same bucket, the conflict > detection logic in ConflictDetection (which already has special handling for > dataEvolutionEnabled) needs to understand column-family boundaries to avoid > false conflicts — or must it serialize all writes? > > 5. Compaction strategy > > This is closely related to point 2. Today, compaction in MergeTreeWriter > operates on all columns together. With column families: > > - Must compaction always process all families in a row group together to > maintain alignment? > - Can individual families be compacted independently (which would be a major > efficiency win for the "update only privacy columns" use case)? > - If families compact independently, how do you handle the case where family > A has 5 levels but family B has 2 levels — the merge-on-read path needs > aligned row counts across families. > > 6. Compatibility and migration > > - How does an existing primary key table migrate to column-family mode? Is it > an ALTER TABLE operation? Does it require rewriting all existing data files? > - What happens if a user adds a column via schema evolution that isn't > assigned to any family? > > Minor comments: > > - The design document could benefit from a concrete example of the file > layout before and after a column-family update, showing the manifest entries > and how firstRowId alignment works. > - Consider whether the column family metadata should be stored in the table > schema (and thus versioned with schema evolution) or as table options. > - For the GDPR use case specifically, it would be helpful to clarify the > end-to-end guarantee: after a column-family rewrite + snapshot expiration, > are old values provably unreachable? Or can orphan files still contain them > until cleanup > runs? > > Overall, I'm supportive of this direction. The key technical challenge is the > firstRowId alignment strategy across compaction cycles for primary key tables > — I think getting that right will unlock everything else cleanly. > > Regards, > Nicholas Jiang > > On 2026/06/06 12:19:40 Aitozi wrote: > > Hi community, > > > > I would like to start a discussion about supporting column-level updates > > for primary key tables. [1] > > > > Primary key tables are widely used for upsert workloads, and users often > > need to update only a subset of columns in wide tables. Typical scenarios > > include historical data backfill, profile correction, metric recalculation, > > and GDPR-like masking or removal of sensitive columns. > > > > Today, partial-update can express column-level logical updates, but it > > appends update records and relies on merge-on-read or compaction to produce > > the latest visible value. This may add extra cost to the read path. For > > physical removal scenarios, old sensitive values may also remain in old > > data files until compaction and snapshot expiration. > > > > This proposal aims to support two capabilities: > > > > 1. Physical column-level updates for primary key tables. > > 2. Predefined column families, so columns that are frequently updated > > together can be stored in dedicated physical files and replaced as a > > column-family unit. > > > > The proposal reuses the existing Data Evolution idea, but applies it to > > primary key tables with a different firstRowId semantic. For primary key > > tables, firstRowId is only an internal alignment id for grouping physical > > files. It is not row tracking and should not be exposed as a user-visible > > row id. > > > > The basic model is: > > > > 1. Files are aligned by (partition, bucket, firstRowId, rowCount). > > 2. firstRowId is used for file-group alignment. > > 3. rowIndex is used for row alignment inside the file group. > > 4. Read path first performs Data Evolution union read to reconstruct > > full rows. > > 5. Then the reconstructed rows are converted to KeyValue and passed to > > the existing MergeTree MOR path. > > > > With column families, users may define groups such as: > > 'column-family.cols.profile' = 'name,age' 'column-family.cols.privacy' = > > 'email,phone,address' 'column-family.cols.metrics' = 'score,level' > > > > For fixed update scenarios, such as sensitive-column masking, Paimon can > > rewrite only the affected column-family files, remove the old files from > > active metadata after commit, and let snapshot expiration or orphan cleanup > > physically delete them later. > > > > > > Looking forward to your feedback. > > > > [1]: > > https://cwiki.apache.org/confluence/display/PAIMON/PIP-44%3A+Support+Data+Evolution+for+Primary+Key+Tables > > > > Best, > > Aitozi. > >
