TheR1sing3un opened a new pull request, #8243:
URL: https://github.com/apache/paimon/pull/8243
### Purpose
Reading a table with a very large STRING/BYTES column can crash with
`TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array`.
A row chunk (`chunk_size = 65536`) can exceed the 2GB per-column limit of
`pyarrow.string()` / `pyarrow.binary()`, which use 32-bit offsets. When that
happens `pyarrow.array()` returns a `ChunkedArray`, and a single
`RecordBatch`
cannot hold a `ChunkedArray`, so `RecordBatch.from_pydict` fails on the
non-Arrow-native (row-based) read path.
Reproduced on current `master` with a single ~2.1GB string column:
```python
import pyarrow as pa
n, blob = 2100, "a" * (1024 * 1024) # 2100 MiB > 2048 MiB
rows = [(i, blob) for i in range(n)]
schema = pa.schema([("id", pa.int64()), ("payload", pa.string())])
pydict = {name: list(col) for name, col in zip(schema.names, zip(*rows))}
pa.RecordBatch.from_pydict(pydict, schema=schema)
# TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array
```
This PR turns the row-to-batch helpers into generators that build each column
array, detect the overflow (a column coming back as a `ChunkedArray`), and
recursively split the rows in half so every emitted `RecordBatch` keeps each
column under the 2GB limit. A single row that still overflows raises a clear
`ValueError` instead of recursing forever. Both the serial
(`_arrow_batch_generator`) and parallel (`_read_one_split_to_batches`) read
paths are updated, and the small-chunk common case still emits exactly one
batch.
### Tests
Added `paimon-python/pypaimon/tests/table_read_chunked_overflow_test.py`,
which
patches `pyarrow.array` to simulate auto-chunking past a small threshold (so
no
real 2GB allocation is needed) and asserts:
- an oversized chunk is split into multiple single-`Array` batches with
data/order preserved, both with and without the `_row_kind` column;
- a below-threshold chunk still produces a single batch;
- a single row that still overflows raises `ValueError`;
- the static `convert_rows_to_arrow_batches` helper splits the same way.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]