[PR] [python] Split row chunks that overflow the 2GB per-column limit on read [paimon]

via GitHub Mon, 15 Jun 2026 04:02:30 -0700


TheR1sing3un opened a new pull request, #8243:
URL: https://github.com/apache/paimon/pull/8243


   ### Purpose
   
   Reading a table with a very large STRING/BYTES column can crash with
   `TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array`.
   
   A row chunk (`chunk_size = 65536`) can exceed the 2GB per-column limit of
   `pyarrow.string()` / `pyarrow.binary()`, which use 32-bit offsets. When that
   happens `pyarrow.array()` returns a `ChunkedArray`, and a single 
`RecordBatch`
   cannot hold a `ChunkedArray`, so `RecordBatch.from_pydict` fails on the
   non-Arrow-native (row-based) read path.
   
   Reproduced on current `master` with a single ~2.1GB string column:
   
   ```python
   import pyarrow as pa
   n, blob = 2100, "a" * (1024 * 1024)   # 2100 MiB > 2048 MiB
   rows = [(i, blob) for i in range(n)]
   schema = pa.schema([("id", pa.int64()), ("payload", pa.string())])
   pydict = {name: list(col) for name, col in zip(schema.names, zip(*rows))}
   pa.RecordBatch.from_pydict(pydict, schema=schema)
   # TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array
   ```
   
   This PR turns the row-to-batch helpers into generators that build each column
   array, detect the overflow (a column coming back as a `ChunkedArray`), and
   recursively split the rows in half so every emitted `RecordBatch` keeps each
   column under the 2GB limit. A single row that still overflows raises a clear
   `ValueError` instead of recursing forever. Both the serial
   (`_arrow_batch_generator`) and parallel (`_read_one_split_to_batches`) read
   paths are updated, and the small-chunk common case still emits exactly one
   batch.
   
   ### Tests
   
   Added `paimon-python/pypaimon/tests/table_read_chunked_overflow_test.py`, 
which
   patches `pyarrow.array` to simulate auto-chunking past a small threshold (so 
no
   real 2GB allocation is needed) and asserts:
   
   - an oversized chunk is split into multiple single-`Array` batches with
     data/order preserved, both with and without the `_row_kind` column;
   - a below-threshold chunk still produces a single batch;
   - a single row that still overflows raises `ValueError`;
   - the static `convert_rows_to_arrow_batches` helper splits the same way.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [python] Split row chunks that overflow the 2GB per-column limit on read [paimon]

Reply via email to