TheR1sing3un opened a new pull request, #374:
URL: https://github.com/apache/paimon-rust/pull/374
### Purpose
Linked issue: close #373
PK tables could return unmerged (duplicate-key) rows in two ways sharing one
root cause — split planning and read dispatch were blind to key-range
overlap:
1. `plan_snapshot` bin-packed a bucket's files purely by size, so files
holding versions of the same key could land in different splits; each
split runs its own sort-merge reader and emits its own version.
Reproducible with `source.split.target-size=1b` and three commits of one
key: SELECT returned 3 rows instead of 1.
2. `read_pk` sent splits without level-0 files to the raw (non-merging)
reader, but compacted files on different levels can still overlap on key
range (e.g. produced by Java/Spark compaction) and need merging.
Reported by @JingsongLi while reviewing #340; affects `deduplicate` and
`partial-update` on main, and `aggregation` once #340 lands.
### Changes
- **`table/merge_tree_split_generator.rs` (new)** — port of Java
`MergeTreeSplitGenerator` / `IntervalPartition`:
- `KeyComparator` decodes serialized BinaryRow min/max keys with the
trimmed-PK data types and compares via `datum_cmp`. BinaryRow stores
fields little-endian, so raw byte comparison would order int 256 before
int 1 — decoding is mandatory for correctness.
- `interval_partition` sorts files by decoded `(min_key, max_key)` and
groups transitively overlapping files into sections; sections never
overlap each other.
- `pack_sections` bin-packs whole sections into splits (reusing
`pack_for_ordered`); a section is atomic and never separated.
- Fail-safe: empty/undecodable key ranges collapse the bucket into one
section — losing parallelism, never correctness.
- **`table_scan.rs`** — PK tables route through
`pack_sections(interval_partition(...))` on the non-data-evolution path;
append-only tables keep the existing file-level `split_for_batch`.
- **`table_read.rs`** — `read_pk` dispatch is now overlap-aware
(`split_requires_merge`): any level-0 file or key-overlapping compacted
files → sort-merge reader; only disjoint compacted files keep the raw
fast path.
- **Deletion-vector / first-row fast path** — mirroring Java
`MergeTreeSplitGenerator`: DV tables resolve stale versions through DVs
(and `KeyValueFileReader` rejects DV splits), and first-row tables skip
level-0 at plan time, so both keep plain size-based packing and the plain
level-0 read dispatch.
### Tests
- 16 unit tests: comparator ordering (little-endian regression), section
grouping, running-bound chains, atomic packing, overlap detection,
undecodable-key degradation.
- 3 plan-level regression tests (memory FileIO, real write→commit→plan→read):
overlapping files share one split and read back merged; disjoint files keep
split parallelism; append tables keep file-level bin pack.
- 2 DataFusion e2e tests reproducing the reviewer scenario for `deduplicate`
and `partial-update`. All new tests fail without the fix.
- Verified against the Spark-provisioned warehouse
(`make docker-up` + `cargo test -p paimon-datafusion --all-targets`),
which caught and now guards the deletion-vector interaction.
### Out of scope
- `first-row` reads never go through `read_pk` (pre-existing routing); the
same overlap consideration for its raw path can be tracked separately.
- Java's `rawConvertible` flag plumbing through `DataSplit`; this PR derives
the same decision read-side from file metadata instead.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]