[PR] fix: merge primary-key rows correctly when bucket files span multiple splits [paimon-rust]

via GitHub Tue, 09 Jun 2026 23:18:01 -0700


TheR1sing3un opened a new pull request, #374:
URL: https://github.com/apache/paimon-rust/pull/374


   ### Purpose
   
   Linked issue: close #373
   
   PK tables could return unmerged (duplicate-key) rows in two ways sharing one
   root cause — split planning and read dispatch were blind to key-range 
overlap:
   
   1. `plan_snapshot` bin-packed a bucket's files purely by size, so files
      holding versions of the same key could land in different splits; each
      split runs its own sort-merge reader and emits its own version.
      Reproducible with `source.split.target-size=1b` and three commits of one
      key: SELECT returned 3 rows instead of 1.
   2. `read_pk` sent splits without level-0 files to the raw (non-merging)
      reader, but compacted files on different levels can still overlap on key
      range (e.g. produced by Java/Spark compaction) and need merging.
   
   Reported by @JingsongLi while reviewing #340; affects `deduplicate` and
   `partial-update` on main, and `aggregation` once #340 lands.
   
   ### Changes
   
   - **`table/merge_tree_split_generator.rs` (new)** — port of Java
     `MergeTreeSplitGenerator` / `IntervalPartition`:
     - `KeyComparator` decodes serialized BinaryRow min/max keys with the
       trimmed-PK data types and compares via `datum_cmp`. BinaryRow stores
       fields little-endian, so raw byte comparison would order int 256 before
       int 1 — decoding is mandatory for correctness.
     - `interval_partition` sorts files by decoded `(min_key, max_key)` and
       groups transitively overlapping files into sections; sections never
       overlap each other.
     - `pack_sections` bin-packs whole sections into splits (reusing
       `pack_for_ordered`); a section is atomic and never separated.
     - Fail-safe: empty/undecodable key ranges collapse the bucket into one
       section — losing parallelism, never correctness.
   - **`table_scan.rs`** — PK tables route through
     `pack_sections(interval_partition(...))` on the non-data-evolution path;
     append-only tables keep the existing file-level `split_for_batch`.
   - **`table_read.rs`** — `read_pk` dispatch is now overlap-aware
     (`split_requires_merge`): any level-0 file or key-overlapping compacted
     files → sort-merge reader; only disjoint compacted files keep the raw
     fast path.
   - **Deletion-vector / first-row fast path** — mirroring Java
     `MergeTreeSplitGenerator`: DV tables resolve stale versions through DVs
     (and `KeyValueFileReader` rejects DV splits), and first-row tables skip
     level-0 at plan time, so both keep plain size-based packing and the plain
     level-0 read dispatch.
   
   ### Tests
   
   - 16 unit tests: comparator ordering (little-endian regression), section
     grouping, running-bound chains, atomic packing, overlap detection,
     undecodable-key degradation.
   - 3 plan-level regression tests (memory FileIO, real write→commit→plan→read):
     overlapping files share one split and read back merged; disjoint files keep
     split parallelism; append tables keep file-level bin pack.
   - 2 DataFusion e2e tests reproducing the reviewer scenario for `deduplicate`
     and `partial-update`. All new tests fail without the fix.
   - Verified against the Spark-provisioned warehouse
     (`make docker-up` + `cargo test -p paimon-datafusion --all-targets`),
     which caught and now guards the deletion-vector interaction.
   
   ### Out of scope
   
   - `first-row` reads never go through `read_pk` (pre-existing routing); the
     same overlap consideration for its raw path can be tracked separately.
   - Java's `rawConvertible` flag plumbing through `DataSplit`; this PR derives
     the same decision read-side from file metadata instead.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] fix: merge primary-key rows correctly when bucket files span multiple splits [paimon-rust]

Reply via email to