shyjsarah opened a new pull request, #350: URL: https://github.com/apache/paimon-rust/pull/350
### Purpose `paimon-rust`'s `MANIFEST_FILE_META_SCHEMA` declares only 9 fields, while the upstream Java `org.apache.paimon.manifest.ManifestFileMeta.SCHEMA` declares 12. The four bucket / level pruning fields added in apache/paimon#5345 (`_MIN_BUCKET`, `_MAX_BUCKET`, `_MIN_LEVEL`, `_MAX_LEVEL`) are absent from both the Avro schema and the Rust struct. Manifests written by `paimon-rust` therefore carry no bucket / level summary. This is not a correctness bug — the Java reader's pruning code in `ManifestsReader.filterManifestFileMeta` guards every branch with `Integer != null`, so missing fields just short-circuit the optimization. But it silently disables three production optimizations on tables originating from `paimon-rust`: - **`specifiedBucket` pruning** — bucket-targeted Spark/Flink scans (`ReadBuilder.withBucket(...)`, bucketed joins, runtime filter pushdown) fall back to reading every manifest instead of filtering by bucket range. - **`levelMinMaxFilter` pruning** — Java's compaction (`CompactAction` / minor compaction) reads extra manifests that could be skipped. - **`onlyReadRealBuckets`** — commit-time `StrictModeChecker` no longer skips virtual / index-only manifests (those with negative bucket numbers). ### Brief change log - Add four `Option<i32>` fields to `ManifestFileMeta` (`crates/paimon/src/spec/manifest_file_meta.rs`) with `#[serde(rename, default, skip_serializing_if = "Option::is_none")]` mirroring the existing `min_row_id` / `max_row_id` pattern, plus getters. - Add a `with_bucket_level_stats(min_bucket, max_bucket, min_level, max_level)` chain method for writers that already have a constructed `ManifestFileMeta`. - Extend `MANIFEST_FILE_META_SCHEMA` with the four `["null", "int"]` Avro fields, inserted between `_SCHEMA_ID` and `_MIN_ROW_ID` to match the Java field order. - Update the hand-written Avro decoder (`crates/paimon/src/spec/avro/manifest_file_meta_decode.rs`) to read the new fields when present (otherwise they fall through the `_ => skip_nullable_field` arm as `None`). - Aggregate min/max bucket and level across manifest entries at write time in `TableCommit::write_manifest_file` (`crates/paimon/src/table/table_commit.rs`) and attach via the new chain method. When the entry list is empty all four stay `None`, mirroring back-compat shape. - Bump `new_with_version` to accept the new positional `Option<i32>` arguments; `new` still defaults them to `None`, so non-writer call sites (tests, `objects_file.rs`) need no churn beyond passing `None` through `new_with_version`. Serializer version is intentionally not bumped — it stays at `2`. Java did the same in apache/paimon#5345 because the fields are nullable with `default: null`, so old and new files coexist. ### Tests Three new tests: - `spec::manifest_list::tests::test_manifest_list_roundtrip_preserves_bucket_level_stats` — writes a manifest list with explicit bucket / level values via `ManifestList::write` and asserts they survive the Avro round-trip through `ManifestList::read`. - `table::table_commit::tests::test_commit_writes_bucket_and_level_stats_into_manifest_list` — drives a real `TableCommit::commit` with messages spanning buckets `[0, 3]` and levels `[0, 2]`, then reads back the manifest list and asserts the aggregates. Exercises the full plumbing from `CommitMessage` through `messages_to_entries` to `write_manifest_file`. - `spec::manifest_list::tests::test_manifest_list_decodes_legacy_without_bucket_level_fields` — fabricates a manifest list written under the pre-5345 Avro schema (no bucket / level fields) and asserts it decodes cleanly with the new getters returning `None`. Pins the back-compat contract. Local verification: - `cargo build -p paimon` — ok - `cargo test -p paimon --lib` — 675 passed, 0 failed - `cargo clippy -p paimon --all-targets -- -D warnings` — clean - `cargo fmt --check` — clean ### API and Format On-disk format: extends `MANIFEST_FILE_META_SCHEMA` with four optional Avro fields. Old `paimon-rust` files decode unchanged (missing fields → `None`); new files are readable by Java (which already has these fields) and by any reader that follows Avro's nullable-union default semantics. Serializer version stays at `2`. Public API: - `ManifestFileMeta` gains four `Option<i32>` getters (`min_bucket`, `max_bucket`, `min_level`, `max_level`) and a `with_bucket_level_stats(...)` chain method. - `ManifestFileMeta::new(...)` signature unchanged — new fields default to `None`. - `ManifestFileMeta::new_with_version(...)` signature gains four `Option<i32>` arguments (positional, after `schema_id`, before `min_row_id`). This is `pub(crate)`; the only external impact is keeping the Avro decoder in sync. ### Documentation Rustdoc comments on the new fields and `with_bucket_level_stats` explain the back-compat semantics (None = stats absent, treat as "no information") and reference apache/paimon#5345 for the upstream change. No prose docs need updating. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
