rok commented on issue #48962: URL: https://github.com/apache/arrow/issues/48962#issuecomment-3871222436
Hey @guillaume-rochette-oxb, sorry I'm late to the party. > I would like to add a functionality enabling to dynamically restack/resize a stream of pa.RecordBatch w.r.t. to minimums and maximums of rows and bytes. Where are your record batches coming from? Parquet reader or some other source? RecordBatch sizes are best controlled at the source e.g. by setting [row_group_size](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html#pyarrow.parquet.ParquetWriter.write) at parquet write time. PyArrow already has [concat](https://arrow.apache.org/docs/python/generated/pyarrow.concat_batches.html#pyarrow-concat-batches) and [slice](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html#pyarrow.RecordBatch.slice) C++ based methods for manipulating RecordBatch size. So you could use those if you already have RecordBatches of a certain size and have an application level utility function to even out your RecordBatch sizes. I am not convinced yet that you need such a utility in PyArrow. It would be useful to know more about your use case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
