guillaume-rochette-oxb commented on issue #48962: URL: https://github.com/apache/arrow/issues/48962#issuecomment-3871294989
Hey @rok, Yes, at runtime, we are reading them with `pyarrow.dataset.dataset().to_batches()`, however, we do not have the possibility to control the `row_group_size` of the Parquet files we are reading from. They come from either exports from BigQuery, or other third-party sources. And the trouble is that we're handling a large variety of datasets with their respective schemas, and they often have a high variance in the amount of bytes per row, so it's quite difficult to come up with a simple set of rules that fits well. Hence, me deciding to resize **both** incoming and outgoing batches with **both** the amount of bytes and rows. That way, even if upstream was not neatly packed, we normalize it, and ensure that downstream would receive neatly packed data. By the way, in the [PR](https://github.com/apache/arrow/pull/48963), we're making use of the `pyarrow.RecordBatch.slice()` and `concat_batches()` capabilities :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
