shyjsarah opened a new pull request, #349:
URL: https://github.com/apache/paimon-rust/pull/349
### Purpose
A non-partitioned table written via `paimon-rust` (e.g. through `fusion` /
`pypaimon`) is unreadable by the Java reader. Spark/Flink/Hive crash with
`BufferUnderflowException` inside `SerializationUtils.deserializeBinaryRow` the
moment they try to scan the manifest list:
```
Caused by: java.nio.BufferUnderflowException
at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:435)
at
org.apache.paimon.utils.SerializationUtils.deserializeBinaryRow(SerializationUtils.java:85)
at org.apache.paimon.stats.SimpleStats.fromRow(SimpleStats.java:97)
at
org.apache.paimon.manifest.ManifestFileMetaSerializer.convertFrom(ManifestFileMetaSerializer.java:75)
...
```
**Repro** — any PK / append table without partitions:
```sql
CREATE TABLE test_pk (order_id INT, price STRING, customer STRING, PRIMARY
KEY (order_id) NOT ENFORCED);
INSERT INTO test_pk VALUES (1, 'TEST', 'TEST'); -- written via paimon-rust
SELECT * FROM test_pk; -- read via Spark → crash
```
**Root cause** — `BinaryTableStats` was being constructed with `Vec::new()`
for `min_values` / `max_values` when there were no columns to collect stats
for. The Java reader uses `SerializationUtils.deserializeBinaryRow`, which
requires at minimum a 4-byte big-endian arity prefix (an `EMPTY_ROW` serializes
to 12 bytes: 4-byte arity + 8-byte null bit set). Zero-length input fails at
the very first `buffer.getInt()`.
The single hot path is `compute_partition_stats` in `table_commit.rs`, which
returns this stats record whenever a non-partitioned table is committed — which
is why the failure is so easy to hit. The same wrong empty value is repeated in
several other places (Avro decode fallbacks, test fixtures) and would re-emit
the same broken bytes if those paths ever round-tripped to disk.
### Brief change log
- Add `BinaryTableStats::empty()` (`crates/paimon/src/spec/stats.rs`) that
returns stats backed by the existing `EMPTY_SERIALIZED_ROW` constant (arity-0
BinaryRow, 12 bytes), with a doc comment explaining the Java-side protocol
contract.
- Fix the production write path: `compute_partition_stats` in
`crates/paimon/src/table/table_commit.rs` now returns
`BinaryTableStats::empty()` instead of zero-length bytes when there are no
partition fields or no entries.
- Fix Avro decode fallbacks so a missing `_PARTITION_STATS` / `key_stats` /
`value_stats` reconstitutes as protocol-valid empty stats
(`crates/paimon/src/spec/avro/manifest_file_meta_decode.rs`,
`crates/paimon/src/spec/avro/manifest_entry_decode.rs`).
- Migrate test fixtures off the bad pattern
(`crates/paimon/src/spec/manifest.rs`,
`crates/paimon/src/table/referenced_files.rs`,
`crates/paimon/src/table/data_evolution_writer.rs`,
`crates/paimon/src/table/table_commit.rs` test helpers).
### Tests
Three new unit tests, all round-tripping the empty stats through
`BinaryRow::from_serialized_bytes` — the Rust mirror of Java's
`SerializationUtils.deserializeBinaryRow`, so passing here ≡ passing on the
Java reader:
- `spec::stats::tests::empty_stats_carries_arity_prefix_parseable_by_reader`
— `BinaryTableStats::empty()` produces bytes ≥ 4 bytes long and decodes back to
an arity=0 BinaryRow.
-
`table::table_commit::tests::compute_partition_stats_no_partition_fields_returns_decodable_empty`
— directly reproduces the user-visible case (non-partitioned table with one
entry).
-
`table::table_commit::tests::compute_partition_stats_empty_entries_returns_decodable_empty`
— covers the other fallback branch (partitioned schema, no entries).
Local verification:
- `cargo build -p paimon` — ok
- `cargo test -p paimon --lib` — 675 passed, 0 failed
- `cargo clippy -p paimon --all-targets -- -D warnings` — clean
- `cargo fmt --check` — clean
### API and Format
No on-disk schema change. The fix restores conformance with the existing
binary format that the Java reference implementation expects.
Public API: adds `BinaryTableStats::empty()` constructor. No breaking
changes — `BinaryTableStats::new(...)` is untouched and remains the way to
construct stats with real values.
### Documentation
No documentation changes required. The new `empty()` method carries a
rustdoc comment explaining when to use it and why `Vec::new()` would break the
Java reader; that captures the contract at the point where future contributors
will look.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]