TheR1sing3un opened a new pull request, #7873: URL: https://github.com/apache/paimon/pull/7873
## Summary Lay the protocol-level groundwork for upcoming compaction work in pypaimon by aligning `CommitMessage` with Java's `CommitMessageImpl` shape and adding a JSON-safe wire format for cross-process transport. **No observable behavior change today** — read / write / commit paths keep producing the same snapshots. This is foundation for the follow-up PRs that ship the compaction module, append-only compaction, PK LSM compaction, and Ray distributed executor. Split from #7771 — originally bundled all of Phase 1-4 into one ~5000-line PR; this is the first of 6 smaller, focused PRs. ## What's in this PR **Structural changes** - New `DataIncrement` (write side) and `CompactIncrement` (compaction side) value objects, direct ports of `org.apache.paimon.io.DataIncrement` and `CompactIncrement`. Each holds `(new_files, deleted_files, changelog_files, new_index_files, deleted_index_files)` so future deletion-vector / changelog work has an unambiguous slot. - `CommitMessage` refactored to `(partition, bucket, total_buckets, data_increment, compact_increment, check_from_snapshot)`. Convenience properties (`new_files`, `compact_before`, `compact_after`, …) preserve read-site ergonomics. - `FileStoreCommit` emits ADD entries for `compact_after`, DELETE entries for `compact_before`, and auto-selects `commit_kind=COMPACT` when a message carries only compact increments. A dedicated `commit_compact()` helper enforces COMPACT-only semantics with no row-id assignment. - `FileStoreWrite` / `TableUpdate` construct `CommitMessage` via `DataIncrement` on the existing write path — no behavior change for current callers. **DataFileMeta serialization** - `to_dict` / `from_dict` round-trip with tagged-value encoding for `bytes`, `Decimal`, `datetime`, `date`, `time`, and `Timestamp` so file metas can ship through JSON-only transports (Ray task payloads later). - Public `encode_value` / `decode_value` helpers reused by `CommitMessage.partition` (DATE / DECIMAL / bytes / Timestamp partitions). - Tolerates manifest-side `BinaryRow` (lazy-decoded) and pyarrow Array-like `null_counts` so round-tripping a freshly-produced file meta doesn't fail. **CommitMessageSerializer** - VERSION=1 wire format covering full `DataIncrement` + `CompactIncrement` shape (including `IndexFileMeta` identity fields). `dv_ranges` / `global_index_meta` will be wired up alongside deletion-vector phases. ## Test plan - [x] New `commit_message_serializer_test`: round-trip `CommitMessage` with `DataIncrement` / `CompactIncrement` / index files / non-JSON-native partition tuples (DATE, Decimal, bytes, Timestamp); IndexFileMeta round-trip; unknown-version rejection. - [x] New `file_store_commit_compact_test`: protocol-level coverage of `compact_before` → DELETE entry, `compact_after` → ADD entry, auto-COMPACT kind selection. Full e2e lands when the compactor lands (PR2+). - [x] Existing `file_store_commit_test` / `partition_predicate_test` / `table_commit_test` updated to construct `CommitMessage` via `DataIncrement` instead of the legacy `new_files=` kwarg (signature change adapter, no production behavior). ## Note: no consumers yet `CompactIncrement` / `commit_compact` / `CommitMessageSerializer` introduced here have no callers in this PR — they're the foundation. The next 5 split PRs build on top: 1. **This PR** — commit protocol foundation 2. Append-only compaction module + LocalExecutor 3. MergeFunction abstraction (read-path refactor) 4. Levels + UniversalCompaction strategy 5. MergeTree compaction integration (PK e2e) 6. Ray distributed executor Each follow-up will land separately for incremental review. 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
