TheR1sing3un opened a new pull request, #7820:
URL: https://github.com/apache/paimon/pull/7820

   ## Purpose
   
   `read_paimon()` crashes with `pyarrow.lib.ArrowInvalid` when reading
   a primary-key table whose data consists of a single snapshot (all
   splits are raw-convertible). The issue is in
   `RayDatasource._get_read_task`:
   
   ```python
   yield pyarrow.Table.from_batches([batch], schema=schema)
   ```
   
   `schema` comes from `PyarrowFieldParser.from_paimon_schema` and marks
   PK columns as `NOT NULL`. The `batch` from the Parquet reader may have
   those columns as nullable — `from_batches` does a strict schema equality
   check (including the nullable bit) and rejects the mismatch.
   
   This is a pre-existing issue on master. It was never triggered by
   existing tests because they all write multiple snapshots (creating
   non-raw-convertible splits that go through the merge-read path, which
   preserves nullability).
   
   ## Linked Issue
   
   Discovered while testing PR #7813 on CI (Python 3.10 / pyarrow in the
   CI container triggers the strict check; newer pyarrow on local dev
   machines is more lenient).
   
   ## Fix
   
   Replace the strict `from_batches([batch], schema=schema)` with:
   
   ```python
   table = pyarrow.Table.from_batches([batch])
   if table.schema != schema:
       table = table.cast(schema)
   yield table
   ```
   
   `Table.cast(target_schema)` is a zero-copy metadata-only operation for
   nullable→not-null diffs. It also handles other type promotions (e.g.
   `large_string → string`) that may occur on some Ray versions.
   
   When schemas already match, the `if` branch is skipped — zero overhead.
   
   ## Tests
   
   Added `test_read_paimon_pk_single_snapshot`: PK table + single write +
   `read_paimon()` — verifies no ArrowInvalid on raw-convertible splits.
   
   All existing `ray_integration_test.py` tests remain green.
   
   ## API & Format Impact
   
   None. Pure internal fix in the Ray read task function.
   
   ## Documentation Impact
   
   None.
   
   ## Generative AI Disclosure
   
   Drafted with Claude Code assistance, reviewed and tested by the author.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to