Re: [DISCUSS] Support Data Evolution for Primary Key Tables

Aitozi Wed, 17 Jun 2026 08:42:55 -0700

Hi
    are there some feedback for this?

Best,
Aitozi


Aitozi <[email protected]> 于2026年6月6日周六 20:19写道：

> Hi community,
>
> I would like to start a discussion about supporting column-level updates
> for primary key tables. [1]
>
> Primary key tables are widely used for upsert workloads, and users often
> need to update only a subset of columns in wide tables. Typical scenarios
> include historical data backfill, profile correction, metric recalculation,
> and GDPR-like masking or removal of sensitive columns.
>
> Today, partial-update can express column-level logical updates, but it
> appends update records and relies on merge-on-read or compaction to produce
> the latest visible value. This may add extra cost to the read path. For
> physical removal scenarios, old sensitive values may also remain in old
> data files until compaction and snapshot expiration.
>
> This proposal aims to support two capabilities:
>
>    1. Physical column-level updates for primary key tables.
>    2. Predefined column families, so columns that are frequently updated
>    together can be stored in dedicated physical files and replaced as a
>    column-family unit.
>
> The proposal reuses the existing Data Evolution idea, but applies it to
> primary key tables with a different firstRowId semantic. For primary key
> tables, firstRowId is only an internal alignment id for grouping physical
> files. It is not row tracking and should not be exposed as a user-visible
> row id.
>
> The basic model is:
>
>    1. Files are aligned by (partition, bucket, firstRowId, rowCount).
>    2. firstRowId is used for file-group alignment.
>    3. rowIndex is used for row alignment inside the file group.
>    4. Read path first performs Data Evolution union read to reconstruct
>    full rows.
>    5. Then the reconstructed rows are converted to KeyValue and passed to
>    the existing MergeTree MOR path.
>
> With column families, users may define groups such as:
> 'column-family.cols.profile' = 'name,age' 'column-family.cols.privacy' =
> 'email,phone,address' 'column-family.cols.metrics' = 'score,level'
>
> For fixed update scenarios, such as sensitive-column masking, Paimon can
> rewrite only the affected column-family files, remove the old files from
> active metadata after commit, and let snapshot expiration or orphan cleanup
> physically delete them later.
>
>
> Looking forward to your feedback.
>
> [1]:
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-44%3A+Support+Data+Evolution+for+Primary+Key+Tables
>
> Best,
> Aitozi.
>

Re: [DISCUSS] Support Data Evolution for Primary Key Tables

Reply via email to