rok commented on issue #48962:
URL: https://github.com/apache/arrow/issues/48962#issuecomment-3871222436

   Hey @guillaume-rochette-oxb, sorry I'm late to the party.
   
   > I would like to add a functionality enabling to dynamically restack/resize 
a stream of pa.RecordBatch w.r.t. to minimums and maximums of rows and bytes.
   
   Where are your record batches coming from? Parquet reader or some other 
source? RecordBatch sizes are best controlled at the source e.g. by setting 
[row_group_size](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html#pyarrow.parquet.ParquetWriter.write)
 at parquet write time.
   
   PyArrow already has 
[concat](https://arrow.apache.org/docs/python/generated/pyarrow.concat_batches.html#pyarrow-concat-batches)
 and 
[slice](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html#pyarrow.RecordBatch.slice)
 C++ based methods for manipulating RecordBatch size. So you could use those if 
you already have RecordBatches of a certain size and have an application level 
utility function to even out your RecordBatch sizes. I am not convinced yet 
that you need such a utility in PyArrow. It would be useful to know more about 
your use case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to