TheR1sing3un opened a new pull request, #7820:
URL: https://github.com/apache/paimon/pull/7820
## Purpose
`read_paimon()` crashes with `pyarrow.lib.ArrowInvalid` when reading
a primary-key table whose data consists of a single snapshot (all
splits are raw-convertible). The issue is in
`RayDatasource._get_read_task`:
```python
yield pyarrow.Table.from_batches([batch], schema=schema)
```
`schema` comes from `PyarrowFieldParser.from_paimon_schema` and marks
PK columns as `NOT NULL`. The `batch` from the Parquet reader may have
those columns as nullable — `from_batches` does a strict schema equality
check (including the nullable bit) and rejects the mismatch.
This is a pre-existing issue on master. It was never triggered by
existing tests because they all write multiple snapshots (creating
non-raw-convertible splits that go through the merge-read path, which
preserves nullability).
## Linked Issue
Discovered while testing PR #7813 on CI (Python 3.10 / pyarrow in the
CI container triggers the strict check; newer pyarrow on local dev
machines is more lenient).
## Fix
Replace the strict `from_batches([batch], schema=schema)` with:
```python
table = pyarrow.Table.from_batches([batch])
if table.schema != schema:
table = table.cast(schema)
yield table
```
`Table.cast(target_schema)` is a zero-copy metadata-only operation for
nullable→not-null diffs. It also handles other type promotions (e.g.
`large_string → string`) that may occur on some Ray versions.
When schemas already match, the `if` branch is skipped — zero overhead.
## Tests
Added `test_read_paimon_pk_single_snapshot`: PK table + single write +
`read_paimon()` — verifies no ArrowInvalid on raw-convertible splits.
All existing `ray_integration_test.py` tests remain green.
## API & Format Impact
None. Pure internal fix in the Ray read task function.
## Documentation Impact
None.
## Generative AI Disclosure
Drafted with Claude Code assistance, reviewed and tested by the author.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]