TheR1sing3un opened a new pull request, #7771: URL: https://github.com/apache/paimon/pull/7771
## Summary Brings Apache Paimon's compaction story to pypaimon end-to-end: - **Phase 1**: Extend `CommitMessage` with `compact_before` / `compact_after`; `FileStoreCommit` emits ADD + DELETE manifest entries and a new `commit_compact()` helper produces snapshots with `commit_kind=COMPACT`. `DataFileMeta` gains JSON-friendly `to_dict` / `from_dict` and a `CommitMessageSerializer` for cross-process transport. - **Phase 2**: Append-only compaction end-to-end via `table.new_compact_job(...).execute()`, plumbed through a `Coordinator → Task → Executor → Driver-commit` shape that mirrors Spark `CompactProcedure`. Ships a `LocalExecutor` for in-process / test usage. - **Phase 3**: Primary-key (merge-tree) compaction. Direct port of `Levels` + `UniversalCompaction` (size-amp / size-ratio / file-num three-stage decision). New abstract `MergeFunction` + factory; `DeduplicateMergeFunction` migrated, `PartialUpdate` / `Aggregate` / `FirstRow` stubbed so configured tables fail loudly with a Phase 6 message. - **Phase 4**: `RayExecutor` wires the same `CompactJob` to Ray. Driver serializes each `CompactTask` to JSON (table + payload + catalog loader spec), workers rebuild their `FileStoreTable` via the catalog and run the rewriter, driver collects messages for one atomic commit. Each phase landed as a separate commit, with a follow-up `*-fixup` commit addressing the review findings inline. Eight commits total — a single PR keeps the design coherent for review while commits stay small enough to walk through. Out of scope (later PRs): - Phase 5 — CLI + `python -m pypaimon.compact.entrypoint` for `ray job submit` - Phase 6 — Real `PartialUpdate` / `Aggregate` / `FirstRow` `MergeFunction` bodies - Phase 7 — Sort compact (zorder/hilbert), Deletion Vectors, Changelog producer, streaming Ray-Actor coordinator Plan / design doc: \`/Users/lcy/.claude/plans/paimon-compaction-java-spark-python-com-cached-cosmos.md\` (local). ## Test plan - [x] Unit: \`Levels\` semantics, \`UniversalCompaction\` three-stage decision (size-amp, size-ratio, file-num, force-pick-L0), \`MergeFunction\` registry + Phase-6 stubs, \`CompactOptions\` validation, \`CommitMessageSerializer\` JSON round-trip, \`CompactTask\` serde with non-JSON-native partitions - [x] Append-only e2e: file-count drops, data identity preserved, \`snapshot.commit_kind == "COMPACT"\`, partitioned per-partition messages, no-op on below-trigger - [x] Primary-key e2e: full-compaction dedup keeps latest, level promotion to >0, no-op below trigger - [x] Rewriter contracts: does not mutate manifest-owned \`DataFileMeta\`, aborts partial output on failure - [x] Ray e2e (skipped if \`ray\` not installed): \`ray.init(local_mode=True)\` runs Append-only compaction end-to-end through real \`ray.remote\` task dispatch + \`ray.get\` collection - [ ] Integration with a Java-written table (manual; left for follow-up before un-drafting) 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
