Re: [DISCUSS] Support Data Evolution for Primary Key Tables

wj wang Fri, 19 Jun 2026 01:13:04 -0700

Thanks for the proposal.

I'm supportive of bringing physical column-level updates to
primary-key tables, especially for GDPR-style masking where today's
partial-update still leaves old values in historical files.


>From the current codebase, reusing Data Evolution looks natural:
`DataEvolutionSplitRead` / `FieldBunch` already solve multi-file
column merging, and `ConflictDetection` already has
`dataEvolutionEnabled` logic keyed by `(partition, bucket, firstRowId,
rowCount)`. That said, I think the hard part is exactly what Nicholas
raised: **how `firstRowId` alignment survives PK-table compaction and
independent column-family rewrites**. Append tables assign
`firstRowId` monotonically at write time; PK files are key-sorted and
frequently rewritten by LSM compaction, so the design needs an
explicit rule — e.g. compact/replace **aligned row groups across
families together**, or re-assign a shared `(firstRowId, rowCount)`
whenever any family in the group is rewritten.

On the read path, the proposed "union read → KeyValue → MergeTree MOR"
pipeline is workable in principle, but today's implementation gap is
real: `DataEvolutionSplitRead.withFilter()` is currently a no-op and
the class explicitly does not support deletion vectors, while
`MergeFileSplitRead` relies heavily on PK-aware pushdown. I'd want the
PIP to spell out where filtering/DV/sequence resolution happen after
column reconstruction, and ideally reuse manifest-level pruning
similar to `DataEvolutionFileStoreScan` to avoid full wide-row
materialization on selective scans.

For column families, I'd favor a mandatory **base family (PK +
sequence/metadata)** plus explicit validation against existing
`sequence-group` config, and clear migration semantics (likely a full
rewrite). Happy to discuss further once there's a concrete
compaction/firstRowId example in the PIP.

On Thu, Jun 18, 2026 at 5:10 AM Nicholas Jiang <[email protected]> wrote:
>
> Hi Aitozi,
>
> Thanks for the proposal. Supporting physical column-level updates for primary
> key tables is a meaningful direction — especially for GDPR masking, where
> partial-update's merge-on-read semantics leave old sensitive values in data
> files until compaction and snapshot expiration.
>
> I've reviewed the PIP-44 design document against the current Paimon codebase
> and have the following comments and questions.
>
> Overall direction: +1
>
> Reusing the existing Data Evolution infrastructure (firstRowId-based file
> alignment, writeCols tracking, union read) is the right approach. The core
> abstractions — DataEvolutionSplitRead, DataEvolutionFileReader, and the
> FieldBunch pattern — already solve multi-file column merging for append-only
> tables. Extending this to primary key tables is a natural evolution.
>
> 1. Read path: Data Evolution union read + MergeTree MOR pipeline composition
>
> The proposal describes: "Read path first performs Data Evolution union read to
> reconstruct full rows, then converts to KeyValue and passes to the existing
> MergeTree MOR path."
>
> This two-stage pipeline raises some concerns:
>
> - Filter push-down limitation. DataEvolutionSplitRead explicitly does not
> support predicate push-down (see the class-level Javadoc: "this class does not
> support filtering push down and deletion vectors, as they can interfere with
> the process of merging columns"). For primary key tables, MergeFileSplitRead
> currently pushes key-range filters into overlapped sections and all filters
> into non-overlapped sections. After Data Evolution union read reconstructs
> full rows, how are filters applied? If filtering can only happen after
> full-row reconstruction, this could regress scan performance for wide tables
> with selective predicates.
> - Deletion vector coordination. DataEvolutionSplitRead also does not support
> deletion vectors. Primary key tables with deletion vectors enabled
> (BucketedDvMaintainer) rely on DV application during the read. The design
> should clarify: are deletion vectors applied before or after the union read?
> If column-family files are rewritten independently, do deletion vectors need
> to become column-family-scoped, or does a single DV still apply to the aligned
> row group?
> - Sequence number handling. MergeFileSplitRead currently auto-includes
> sequence fields in the read type (see adjustActualReadType), and
> PartialUpdateMergeFunction relies on sequence group comparators to resolve
> conflicts. When column families are stored in separate physical files, which
> file carries the sequence number? If the sequence field is in column-family A
> but the update touches column-family B, how is ordering resolved?
>
> 2. firstRowId semantics for primary key tables
>
> The proposal states: "For primary key tables, firstRowId is only an internal
> alignment id for grouping physical files. It is not row tracking and should
> not be exposed as a user-visible row id."
>
> This is an important distinction. Currently, firstRowId in append-only tables
> is assigned by AppendOnlyWriter using a LongCounter and is monotonically
> increasing. For primary key tables, the alignment story is different because:
>
> - LSM compaction rewrites files. When a compaction produces new files in a
> higher level, how is firstRowId assigned? The rows in the compacted file have
> been merged — their original alignment with column-family files may no longer
> hold.
> - Key-ordered files. Primary key table files are sorted by key, not by
> insertion order. The proposal should clarify how firstRowId alignment is
> maintained across compaction for different column families.
>
> I think this is the hardest part of the design. Can you elaborate on the
> firstRowId assignment strategy during compaction — specifically, whether all
> column families in the same row group must be compacted together, or whether
> they can be compacted independently?
>
> 3. Column family definition and configuration
>
> The proposed configuration style:
> 'column-family.cols.profile' = 'name,age'
> 'column-family.cols.privacy' = 'email,phone,address'
> 'column-family.cols.metrics' = 'score,level'
>
> A few questions:
>
> - Primary key columns. Are primary key columns implicitly in every column
> family (since they're needed for merge)? Or is there a "base" family that
> always contains the PK + sequence fields?
> - Unassigned columns. What happens to columns not assigned to any family? Are
> they in a default family, or is it an error?
> The proposal states: "For primary key tables, firstRowId is only an internal 
> alignment id for grouping physical files. It is not row tracking and should 
> not be exposed as a user-visible row id."
>
> This is an important distinction. Currently, firstRowId in append-only tables 
> is assigned by AppendOnlyWriter using a LongCounter and is monotonically 
> increasing. For primary key tables, the alignment story is different because:
>
> - LSM compaction rewrites files. When a compaction produces new files in a 
> higher level, how is firstRowId assigned? The rows in the compacted file have 
> been merged — their original alignment with column-family files may no longer 
> hold.
> - Key-ordered files. Primary key table files are sorted by key, not by 
> insertion order. The proposal should clarify how firstRowId alignment is 
> maintained across compaction for different column families.
>
> I think this is the hardest part of the design. Can you elaborate on the 
> firstRowId assignment strategy during compaction — specifically, whether all 
> column families in the same row group must be compacted together, or whether 
> they can be
> compacted independently?
>
> 3. Column family definition and configuration
>
> The proposed configuration style:
> 'column-family.cols.profile' = 'name,age'
> 'column-family.cols.privacy' = 'email,phone,address'
> 'column-family.cols.metrics' = 'score,level'
>
> A few questions:
>
> - Primary key columns. Are primary key columns implicitly in every column 
> family (since they're needed for merge)? Or is there a "base" family that 
> always contains the PK + sequence fields?
> - Unassigned columns. What happens to columns not assigned to any family? Are 
> they in a default family, or is it an error?
> - Interaction with sequence groups. PartialUpdateMergeFunction already 
> supports fields.<field>.sequence-group for grouping fields by sequence 
> number. What is the relationship between column families and sequence groups? 
> Can a sequence group
> span multiple column families? If not, how is this validated?
>
> 4. Write path: column-family-scoped file rewriting
>
> The proposal mentions: "Paimon can rewrite only the affected column-family 
> files, remove the old files from active metadata after commit."
>
> This implies a new commit operation that:
> - Reads the existing data for only the affected column families
> - Rewrites those column-family files with updated values
> - Atomically removes old files and adds new files in the manifest
>
> How does this interact with concurrent writes? If a normal upsert and a 
> column-family rewrite happen concurrently on the same bucket, the conflict 
> detection logic in ConflictDetection (which already has special handling for
> dataEvolutionEnabled) needs to understand column-family boundaries to avoid 
> false conflicts — or must it serialize all writes?
>
> 5. Compaction strategy
>
> This is closely related to point 2. Today, compaction in MergeTreeWriter 
> operates on all columns together. With column families:
>
> - Must compaction always process all families in a row group together to 
> maintain alignment?
> - Can individual families be compacted independently (which would be a major 
> efficiency win for the "update only privacy columns" use case)?
> - If families compact independently, how do you handle the case where family 
> A has 5 levels but family B has 2 levels — the merge-on-read path needs 
> aligned row counts across families.
>
> 6. Compatibility and migration
>
> - How does an existing primary key table migrate to column-family mode? Is it 
> an ALTER TABLE operation? Does it require rewriting all existing data files?
> - What happens if a user adds a column via schema evolution that isn't 
> assigned to any family?
>
> Minor comments:
>
> - The design document could benefit from a concrete example of the file 
> layout before and after a column-family update, showing the manifest entries 
> and how firstRowId alignment works.
> - Consider whether the column family metadata should be stored in the table 
> schema (and thus versioned with schema evolution) or as table options.
> - For the GDPR use case specifically, it would be helpful to clarify the 
> end-to-end guarantee: after a column-family rewrite + snapshot expiration, 
> are old values provably unreachable? Or can orphan files still contain them 
> until cleanup
> runs?
>
> Overall, I'm supportive of this direction. The key technical challenge is the 
> firstRowId alignment strategy across compaction cycles for primary key tables 
> — I think getting that right will unlock everything else cleanly.
>
> Regards,
> Nicholas Jiang
>
> On 2026/06/06 12:19:40 Aitozi wrote:
> > Hi community,
> >
> > I would like to start a discussion about supporting column-level updates
> > for primary key tables. [1]
> >
> > Primary key tables are widely used for upsert workloads, and users often
> > need to update only a subset of columns in wide tables. Typical scenarios
> > include historical data backfill, profile correction, metric recalculation,
> > and GDPR-like masking or removal of sensitive columns.
> >
> > Today, partial-update can express column-level logical updates, but it
> > appends update records and relies on merge-on-read or compaction to produce
> > the latest visible value. This may add extra cost to the read path. For
> > physical removal scenarios, old sensitive values may also remain in old
> > data files until compaction and snapshot expiration.
> >
> > This proposal aims to support two capabilities:
> >
> >    1. Physical column-level updates for primary key tables.
> >    2. Predefined column families, so columns that are frequently updated
> >    together can be stored in dedicated physical files and replaced as a
> >    column-family unit.
> >
> > The proposal reuses the existing Data Evolution idea, but applies it to
> > primary key tables with a different firstRowId semantic. For primary key
> > tables, firstRowId is only an internal alignment id for grouping physical
> > files. It is not row tracking and should not be exposed as a user-visible
> > row id.
> >
> > The basic model is:
> >
> >    1. Files are aligned by (partition, bucket, firstRowId, rowCount).
> >    2. firstRowId is used for file-group alignment.
> >    3. rowIndex is used for row alignment inside the file group.
> >    4. Read path first performs Data Evolution union read to reconstruct
> >    full rows.
> >    5. Then the reconstructed rows are converted to KeyValue and passed to
> >    the existing MergeTree MOR path.
> >
> > With column families, users may define groups such as:
> > 'column-family.cols.profile' = 'name,age' 'column-family.cols.privacy' =
> > 'email,phone,address' 'column-family.cols.metrics' = 'score,level'
> >
> > For fixed update scenarios, such as sensitive-column masking, Paimon can
> > rewrite only the affected column-family files, remove the old files from
> > active metadata after commit, and let snapshot expiration or orphan cleanup
> > physically delete them later.
> >
> >
> > Looking forward to your feedback.
> >
> > [1]:
> > https://cwiki.apache.org/confluence/display/PAIMON/PIP-44%3A+Support+Data+Evolution+for+Primary+Key+Tables
> >
> > Best,
> > Aitozi.
> >

Re: [DISCUSS] Support Data Evolution for Primary Key Tables

Reply via email to