shyjsarah opened a new pull request, #350:
URL: https://github.com/apache/paimon-rust/pull/350

   ### Purpose
   
   `paimon-rust`'s `MANIFEST_FILE_META_SCHEMA` declares only 9 fields, while 
the upstream Java `org.apache.paimon.manifest.ManifestFileMeta.SCHEMA` declares 
12. The four bucket / level pruning fields added in apache/paimon#5345 
(`_MIN_BUCKET`, `_MAX_BUCKET`, `_MIN_LEVEL`, `_MAX_LEVEL`) are absent from both 
the Avro schema and the Rust struct. Manifests written by `paimon-rust` 
therefore carry no bucket / level summary.
   
   This is not a correctness bug — the Java reader's pruning code in 
`ManifestsReader.filterManifestFileMeta` guards every branch with `Integer != 
null`, so missing fields just short-circuit the optimization. But it silently 
disables three production optimizations on tables originating from 
`paimon-rust`:
   
   - **`specifiedBucket` pruning** — bucket-targeted Spark/Flink scans 
(`ReadBuilder.withBucket(...)`, bucketed joins, runtime filter pushdown) fall 
back to reading every manifest instead of filtering by bucket range.
   - **`levelMinMaxFilter` pruning** — Java's compaction (`CompactAction` / 
minor compaction) reads extra manifests that could be skipped.
   - **`onlyReadRealBuckets`** — commit-time `StrictModeChecker` no longer 
skips virtual / index-only manifests (those with negative bucket numbers).
   
   ### Brief change log
   
   - Add four `Option<i32>` fields to `ManifestFileMeta` 
(`crates/paimon/src/spec/manifest_file_meta.rs`) with `#[serde(rename, default, 
skip_serializing_if = "Option::is_none")]` mirroring the existing `min_row_id` 
/ `max_row_id` pattern, plus getters.
   - Add a `with_bucket_level_stats(min_bucket, max_bucket, min_level, 
max_level)` chain method for writers that already have a constructed 
`ManifestFileMeta`.
   - Extend `MANIFEST_FILE_META_SCHEMA` with the four `["null", "int"]` Avro 
fields, inserted between `_SCHEMA_ID` and `_MIN_ROW_ID` to match the Java field 
order.
   - Update the hand-written Avro decoder 
(`crates/paimon/src/spec/avro/manifest_file_meta_decode.rs`) to read the new 
fields when present (otherwise they fall through the `_ => skip_nullable_field` 
arm as `None`).
   - Aggregate min/max bucket and level across manifest entries at write time 
in `TableCommit::write_manifest_file` 
(`crates/paimon/src/table/table_commit.rs`) and attach via the new chain 
method. When the entry list is empty all four stay `None`, mirroring 
back-compat shape.
   - Bump `new_with_version` to accept the new positional `Option<i32>` 
arguments; `new` still defaults them to `None`, so non-writer call sites 
(tests, `objects_file.rs`) need no churn beyond passing `None` through 
`new_with_version`.
   
   Serializer version is intentionally not bumped — it stays at `2`. Java did 
the same in apache/paimon#5345 because the fields are nullable with `default: 
null`, so old and new files coexist.
   
   ### Tests
   
   Three new tests:
   
   - 
`spec::manifest_list::tests::test_manifest_list_roundtrip_preserves_bucket_level_stats`
 — writes a manifest list with explicit bucket / level values via 
`ManifestList::write` and asserts they survive the Avro round-trip through 
`ManifestList::read`.
   - 
`table::table_commit::tests::test_commit_writes_bucket_and_level_stats_into_manifest_list`
 — drives a real `TableCommit::commit` with messages spanning buckets `[0, 3]` 
and levels `[0, 2]`, then reads back the manifest list and asserts the 
aggregates. Exercises the full plumbing from `CommitMessage` through 
`messages_to_entries` to `write_manifest_file`.
   - 
`spec::manifest_list::tests::test_manifest_list_decodes_legacy_without_bucket_level_fields`
 — fabricates a manifest list written under the pre-5345 Avro schema (no bucket 
/ level fields) and asserts it decodes cleanly with the new getters returning 
`None`. Pins the back-compat contract.
   
   Local verification:
   
   - `cargo build -p paimon` — ok
   - `cargo test -p paimon --lib` — 675 passed, 0 failed
   - `cargo clippy -p paimon --all-targets -- -D warnings` — clean
   - `cargo fmt --check` — clean
   
   ### API and Format
   
   On-disk format: extends `MANIFEST_FILE_META_SCHEMA` with four optional Avro 
fields. Old `paimon-rust` files decode unchanged (missing fields → `None`); new 
files are readable by Java (which already has these fields) and by any reader 
that follows Avro's nullable-union default semantics. Serializer version stays 
at `2`.
   
   Public API:
   - `ManifestFileMeta` gains four `Option<i32>` getters (`min_bucket`, 
`max_bucket`, `min_level`, `max_level`) and a `with_bucket_level_stats(...)` 
chain method.
   - `ManifestFileMeta::new(...)` signature unchanged — new fields default to 
`None`.
   - `ManifestFileMeta::new_with_version(...)` signature gains four 
`Option<i32>` arguments (positional, after `schema_id`, before `min_row_id`). 
This is `pub(crate)`; the only external impact is keeping the Avro decoder in 
sync.
   
   ### Documentation
   
   Rustdoc comments on the new fields and `with_bucket_level_stats` explain the 
back-compat semantics (None = stats absent, treat as "no information") and 
reference apache/paimon#5345 for the upstream change. No prose docs need 
updating.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to