shyjsarah opened a new pull request, #349:
URL: https://github.com/apache/paimon-rust/pull/349

   ### Purpose
   
   A non-partitioned table written via `paimon-rust` (e.g. through `fusion` / 
`pypaimon`) is unreadable by the Java reader. Spark/Flink/Hive crash with 
`BufferUnderflowException` inside `SerializationUtils.deserializeBinaryRow` the 
moment they try to scan the manifest list:
   
   ```
   Caused by: java.nio.BufferUnderflowException
       at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:435)
       at 
org.apache.paimon.utils.SerializationUtils.deserializeBinaryRow(SerializationUtils.java:85)
       at org.apache.paimon.stats.SimpleStats.fromRow(SimpleStats.java:97)
       at 
org.apache.paimon.manifest.ManifestFileMetaSerializer.convertFrom(ManifestFileMetaSerializer.java:75)
       ...
   ```
   
   **Repro** — any PK / append table without partitions:
   
   ```sql
   CREATE TABLE test_pk (order_id INT, price STRING, customer STRING, PRIMARY 
KEY (order_id) NOT ENFORCED);
   INSERT INTO test_pk VALUES (1, 'TEST', 'TEST');   -- written via paimon-rust
   SELECT * FROM test_pk;                            -- read via Spark → crash
   ```
   
   **Root cause** — `BinaryTableStats` was being constructed with `Vec::new()` 
for `min_values` / `max_values` when there were no columns to collect stats 
for. The Java reader uses `SerializationUtils.deserializeBinaryRow`, which 
requires at minimum a 4-byte big-endian arity prefix (an `EMPTY_ROW` serializes 
to 12 bytes: 4-byte arity + 8-byte null bit set). Zero-length input fails at 
the very first `buffer.getInt()`.
   
   The single hot path is `compute_partition_stats` in `table_commit.rs`, which 
returns this stats record whenever a non-partitioned table is committed — which 
is why the failure is so easy to hit. The same wrong empty value is repeated in 
several other places (Avro decode fallbacks, test fixtures) and would re-emit 
the same broken bytes if those paths ever round-tripped to disk.
   
   ### Brief change log
   
   - Add `BinaryTableStats::empty()` (`crates/paimon/src/spec/stats.rs`) that 
returns stats backed by the existing `EMPTY_SERIALIZED_ROW` constant (arity-0 
BinaryRow, 12 bytes), with a doc comment explaining the Java-side protocol 
contract.
   - Fix the production write path: `compute_partition_stats` in 
`crates/paimon/src/table/table_commit.rs` now returns 
`BinaryTableStats::empty()` instead of zero-length bytes when there are no 
partition fields or no entries.
   - Fix Avro decode fallbacks so a missing `_PARTITION_STATS` / `key_stats` / 
`value_stats` reconstitutes as protocol-valid empty stats 
(`crates/paimon/src/spec/avro/manifest_file_meta_decode.rs`, 
`crates/paimon/src/spec/avro/manifest_entry_decode.rs`).
   - Migrate test fixtures off the bad pattern 
(`crates/paimon/src/spec/manifest.rs`, 
`crates/paimon/src/table/referenced_files.rs`, 
`crates/paimon/src/table/data_evolution_writer.rs`, 
`crates/paimon/src/table/table_commit.rs` test helpers).
   
   ### Tests
   
   Three new unit tests, all round-tripping the empty stats through 
`BinaryRow::from_serialized_bytes` — the Rust mirror of Java's 
`SerializationUtils.deserializeBinaryRow`, so passing here ≡ passing on the 
Java reader:
   
   - `spec::stats::tests::empty_stats_carries_arity_prefix_parseable_by_reader` 
— `BinaryTableStats::empty()` produces bytes ≥ 4 bytes long and decodes back to 
an arity=0 BinaryRow.
   - 
`table::table_commit::tests::compute_partition_stats_no_partition_fields_returns_decodable_empty`
 — directly reproduces the user-visible case (non-partitioned table with one 
entry).
   - 
`table::table_commit::tests::compute_partition_stats_empty_entries_returns_decodable_empty`
 — covers the other fallback branch (partitioned schema, no entries).
   
   Local verification:
   
   - `cargo build -p paimon` — ok
   - `cargo test -p paimon --lib` — 675 passed, 0 failed
   - `cargo clippy -p paimon --all-targets -- -D warnings` — clean
   - `cargo fmt --check` — clean
   
   ### API and Format
   
   No on-disk schema change. The fix restores conformance with the existing 
binary format that the Java reference implementation expects.
   
   Public API: adds `BinaryTableStats::empty()` constructor. No breaking 
changes — `BinaryTableStats::new(...)` is untouched and remains the way to 
construct stats with real values.
   
   ### Documentation
   
   No documentation changes required. The new `empty()` method carries a 
rustdoc comment explaining when to use it and why `Vec::new()` would break the 
Java reader; that captures the contract at the point where future contributors 
will look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to