[PR] [python] Add compaction module with Ray distributed executor [paimon]

via GitHub Tue, 05 May 2026 19:25:05 -0700


TheR1sing3un opened a new pull request, #7771:
URL: https://github.com/apache/paimon/pull/7771


   ## Summary
   
   Brings Apache Paimon's compaction story to pypaimon end-to-end:
   
   - **Phase 1**: Extend `CommitMessage` with `compact_before` / 
`compact_after`; `FileStoreCommit` emits ADD + DELETE manifest entries and a 
new `commit_compact()` helper produces snapshots with `commit_kind=COMPACT`. 
`DataFileMeta` gains JSON-friendly `to_dict` / `from_dict` and a 
`CommitMessageSerializer` for cross-process transport.
   - **Phase 2**: Append-only compaction end-to-end via 
`table.new_compact_job(...).execute()`, plumbed through a `Coordinator → Task → 
Executor → Driver-commit` shape that mirrors Spark `CompactProcedure`. Ships a 
`LocalExecutor` for in-process / test usage.
   - **Phase 3**: Primary-key (merge-tree) compaction. Direct port of `Levels` 
+ `UniversalCompaction` (size-amp / size-ratio / file-num three-stage 
decision). New abstract `MergeFunction` + factory; `DeduplicateMergeFunction` 
migrated, `PartialUpdate` / `Aggregate` / `FirstRow` stubbed so configured 
tables fail loudly with a Phase 6 message.
   - **Phase 4**: `RayExecutor` wires the same `CompactJob` to Ray. Driver 
serializes each `CompactTask` to JSON (table + payload + catalog loader spec), 
workers rebuild their `FileStoreTable` via the catalog and run the rewriter, 
driver collects messages for one atomic commit.
   
   Each phase landed as a separate commit, with a follow-up `*-fixup` commit 
addressing the review findings inline. Eight commits total — a single PR keeps 
the design coherent for review while commits stay small enough to walk through.
   
   Out of scope (later PRs):
   
   - Phase 5 — CLI + `python -m pypaimon.compact.entrypoint` for `ray job 
submit`
   - Phase 6 — Real `PartialUpdate` / `Aggregate` / `FirstRow` `MergeFunction` 
bodies
   - Phase 7 — Sort compact (zorder/hilbert), Deletion Vectors, Changelog 
producer, streaming Ray-Actor coordinator
   
   Plan / design doc: 
\`/Users/lcy/.claude/plans/paimon-compaction-java-spark-python-com-cached-cosmos.md\`
 (local).
   
   ## Test plan
   
   - [x] Unit: \`Levels\` semantics, \`UniversalCompaction\` three-stage 
decision (size-amp, size-ratio, file-num, force-pick-L0), \`MergeFunction\` 
registry + Phase-6 stubs, \`CompactOptions\` validation, 
\`CommitMessageSerializer\` JSON round-trip, \`CompactTask\` serde with 
non-JSON-native partitions
   - [x] Append-only e2e: file-count drops, data identity preserved, 
\`snapshot.commit_kind == "COMPACT"\`, partitioned per-partition messages, 
no-op on below-trigger
   - [x] Primary-key e2e: full-compaction dedup keeps latest, level promotion 
to >0, no-op below trigger
   - [x] Rewriter contracts: does not mutate manifest-owned \`DataFileMeta\`, 
aborts partial output on failure
   - [x] Ray e2e (skipped if \`ray\` not installed): 
\`ray.init(local_mode=True)\` runs Append-only compaction end-to-end through 
real \`ray.remote\` task dispatch + \`ray.get\` collection
   - [ ] Integration with a Java-written table (manual; left for follow-up 
before un-drafting)
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [python] Add compaction module with Ray distributed executor [paimon]

Reply via email to