Re: [DISCUSS] Support Data Evolution for Primary Key Tables

Nicholas Jiang Wed, 17 Jun 2026 14:10:24 -0700

Hi Aitozi,

Thanks for the proposal. Supporting physical column-level updates for primary
key tables is a meaningful direction — especially for GDPR masking, where
partial-update's merge-on-read semantics leave old sensitive values in data
files until compaction and snapshot expiration.


I've reviewed the PIP-44 design document against the current Paimon codebase
and have the following comments and questions.

Overall direction: +1

Reusing the existing Data Evolution infrastructure (firstRowId-based file
alignment, writeCols tracking, union read) is the right approach. The core
abstractions — DataEvolutionSplitRead, DataEvolutionFileReader, and the
FieldBunch pattern — already solve multi-file column merging for append-only
tables. Extending this to primary key tables is a natural evolution.

1. Read path: Data Evolution union read + MergeTree MOR pipeline composition

The proposal describes: "Read path first performs Data Evolution union read to
reconstruct full rows, then converts to KeyValue and passes to the existing
MergeTree MOR path."

This two-stage pipeline raises some concerns:

- Filter push-down limitation. DataEvolutionSplitRead explicitly does not
support predicate push-down (see the class-level Javadoc: "this class does not
support filtering push down and deletion vectors, as they can interfere with
the process of merging columns"). For primary key tables, MergeFileSplitRead
currently pushes key-range filters into overlapped sections and all filters
into non-overlapped sections. After Data Evolution union read reconstructs
full rows, how are filters applied? If filtering can only happen after
full-row reconstruction, this could regress scan performance for wide tables
with selective predicates.
- Deletion vector coordination. DataEvolutionSplitRead also does not support
deletion vectors. Primary key tables with deletion vectors enabled
(BucketedDvMaintainer) rely on DV application during the read. The design
should clarify: are deletion vectors applied before or after the union read?
If column-family files are rewritten independently, do deletion vectors need
to become column-family-scoped, or does a single DV still apply to the aligned
row group?
- Sequence number handling. MergeFileSplitRead currently auto-includes
sequence fields in the read type (see adjustActualReadType), and
PartialUpdateMergeFunction relies on sequence group comparators to resolve
conflicts. When column families are stored in separate physical files, which
file carries the sequence number? If the sequence field is in column-family A
but the update touches column-family B, how is ordering resolved?

2. firstRowId semantics for primary key tables

The proposal states: "For primary key tables, firstRowId is only an internal
alignment id for grouping physical files. It is not row tracking and should
not be exposed as a user-visible row id."

This is an important distinction. Currently, firstRowId in append-only tables
is assigned by AppendOnlyWriter using a LongCounter and is monotonically
increasing. For primary key tables, the alignment story is different because:

- LSM compaction rewrites files. When a compaction produces new files in a
higher level, how is firstRowId assigned? The rows in the compacted file have
been merged — their original alignment with column-family files may no longer
hold.
- Key-ordered files. Primary key table files are sorted by key, not by
insertion order. The proposal should clarify how firstRowId alignment is
maintained across compaction for different column families.

I think this is the hardest part of the design. Can you elaborate on the
firstRowId assignment strategy during compaction — specifically, whether all
column families in the same row group must be compacted together, or whether
they can be compacted independently?

3. Column family definition and configuration

The proposed configuration style:
'column-family.cols.profile' = 'name,age'
'column-family.cols.privacy' = 'email,phone,address'
'column-family.cols.metrics' = 'score,level'

A few questions:

- Primary key columns. Are primary key columns implicitly in every column
family (since they're needed for merge)? Or is there a "base" family that
always contains the PK + sequence fields?
- Unassigned columns. What happens to columns not assigned to any family? Are
they in a default family, or is it an error?
The proposal states: "For primary key tables, firstRowId is only an internal 
alignment id for grouping physical files. It is not row tracking and should not 
be exposed as a user-visible row id."

This is an important distinction. Currently, firstRowId in append-only tables 
is assigned by AppendOnlyWriter using a LongCounter and is monotonically 
increasing. For primary key tables, the alignment story is different because:

- LSM compaction rewrites files. When a compaction produces new files in a 
higher level, how is firstRowId assigned? The rows in the compacted file have 
been merged — their original alignment with column-family files may no longer 
hold.
- Key-ordered files. Primary key table files are sorted by key, not by 
insertion order. The proposal should clarify how firstRowId alignment is 
maintained across compaction for different column families.

I think this is the hardest part of the design. Can you elaborate on the 
firstRowId assignment strategy during compaction — specifically, whether all 
column families in the same row group must be compacted together, or whether 
they can be
compacted independently?

3. Column family definition and configuration

The proposed configuration style:
'column-family.cols.profile' = 'name,age'
'column-family.cols.privacy' = 'email,phone,address'
'column-family.cols.metrics' = 'score,level'

A few questions:

- Primary key columns. Are primary key columns implicitly in every column 
family (since they're needed for merge)? Or is there a "base" family that 
always contains the PK + sequence fields?
- Unassigned columns. What happens to columns not assigned to any family? Are 
they in a default family, or is it an error?
- Interaction with sequence groups. PartialUpdateMergeFunction already supports 
fields.<field>.sequence-group for grouping fields by sequence number. What is 
the relationship between column families and sequence groups? Can a sequence 
group
span multiple column families? If not, how is this validated?

4. Write path: column-family-scoped file rewriting

The proposal mentions: "Paimon can rewrite only the affected column-family 
files, remove the old files from active metadata after commit."

This implies a new commit operation that:
- Reads the existing data for only the affected column families
- Rewrites those column-family files with updated values
- Atomically removes old files and adds new files in the manifest

How does this interact with concurrent writes? If a normal upsert and a 
column-family rewrite happen concurrently on the same bucket, the conflict 
detection logic in ConflictDetection (which already has special handling for
dataEvolutionEnabled) needs to understand column-family boundaries to avoid 
false conflicts — or must it serialize all writes?

5. Compaction strategy

This is closely related to point 2. Today, compaction in MergeTreeWriter 
operates on all columns together. With column families:

- Must compaction always process all families in a row group together to 
maintain alignment?
- Can individual families be compacted independently (which would be a major 
efficiency win for the "update only privacy columns" use case)?
- If families compact independently, how do you handle the case where family A 
has 5 levels but family B has 2 levels — the merge-on-read path needs aligned 
row counts across families.

6. Compatibility and migration

- How does an existing primary key table migrate to column-family mode? Is it 
an ALTER TABLE operation? Does it require rewriting all existing data files?
- What happens if a user adds a column via schema evolution that isn't assigned 
to any family?

Minor comments:

- The design document could benefit from a concrete example of the file layout 
before and after a column-family update, showing the manifest entries and how 
firstRowId alignment works.
- Consider whether the column family metadata should be stored in the table 
schema (and thus versioned with schema evolution) or as table options.
- For the GDPR use case specifically, it would be helpful to clarify the 
end-to-end guarantee: after a column-family rewrite + snapshot expiration, are 
old values provably unreachable? Or can orphan files still contain them until 
cleanup
runs?

Overall, I'm supportive of this direction. The key technical challenge is the 
firstRowId alignment strategy across compaction cycles for primary key tables — 
I think getting that right will unlock everything else cleanly.

Regards,
Nicholas Jiang

On 2026/06/06 12:19:40 Aitozi wrote:
> Hi community,
> 
> I would like to start a discussion about supporting column-level updates
> for primary key tables. [1]
> 
> Primary key tables are widely used for upsert workloads, and users often
> need to update only a subset of columns in wide tables. Typical scenarios
> include historical data backfill, profile correction, metric recalculation,
> and GDPR-like masking or removal of sensitive columns.
> 
> Today, partial-update can express column-level logical updates, but it
> appends update records and relies on merge-on-read or compaction to produce
> the latest visible value. This may add extra cost to the read path. For
> physical removal scenarios, old sensitive values may also remain in old
> data files until compaction and snapshot expiration.
> 
> This proposal aims to support two capabilities:
> 
>    1. Physical column-level updates for primary key tables.
>    2. Predefined column families, so columns that are frequently updated
>    together can be stored in dedicated physical files and replaced as a
>    column-family unit.
> 
> The proposal reuses the existing Data Evolution idea, but applies it to
> primary key tables with a different firstRowId semantic. For primary key
> tables, firstRowId is only an internal alignment id for grouping physical
> files. It is not row tracking and should not be exposed as a user-visible
> row id.
> 
> The basic model is:
> 
>    1. Files are aligned by (partition, bucket, firstRowId, rowCount).
>    2. firstRowId is used for file-group alignment.
>    3. rowIndex is used for row alignment inside the file group.
>    4. Read path first performs Data Evolution union read to reconstruct
>    full rows.
>    5. Then the reconstructed rows are converted to KeyValue and passed to
>    the existing MergeTree MOR path.
> 
> With column families, users may define groups such as:
> 'column-family.cols.profile' = 'name,age' 'column-family.cols.privacy' =
> 'email,phone,address' 'column-family.cols.metrics' = 'score,level'
> 
> For fixed update scenarios, such as sensitive-column masking, Paimon can
> rewrite only the affected column-family files, remove the old files from
> active metadata after commit, and let snapshot expiration or orphan cleanup
> physically delete them later.
> 
> 
> Looking forward to your feedback.
> 
> [1]:
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-44%3A+Support+Data+Evolution+for+Primary+Key+Tables
> 
> Best,
> Aitozi.
>

Re: [DISCUSS] Support Data Evolution for Primary Key Tables

Reply via email to