Re: [I] [Python] Add `restack_batches()` [arrow]

via GitHub Mon, 09 Feb 2026 03:58:38 -0800


guillaume-rochette-oxb commented on issue #48962:
URL: https://github.com/apache/arrow/issues/48962#issuecomment-3871294989


   Hey @rok,
   Yes, at runtime, we are reading them with 
`pyarrow.dataset.dataset().to_batches()`, however, we do not have the 
possibility to control the `row_group_size` of the Parquet files we are reading 
from. They come from either exports from BigQuery, or other third-party sources.
   And the trouble is that we're handling a large variety of datasets with 
their respective schemas, and they often have a high variance in the amount of 
bytes per row, so it's quite difficult to come up with a simple set of rules 
that fits well.
   Hence, me deciding to resize **both** incoming and outgoing batches with 
**both** the amount of bytes and rows.
   That way, even if upstream was not neatly packed, we normalize it, and 
ensure that downstream would receive neatly packed data.
   By the way, in the [PR](https://github.com/apache/arrow/pull/48963), we're 
making use of the `pyarrow.RecordBatch.slice()` and `concat_batches()` 
capabilities :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] Add `restack_batches()` [arrow]

Reply via email to