[GitHub] [arrow-rs] alamb opened a new issue #343: Add a RecordBatch::split

GitBox Mon, 24 May 2021 09:23:27 -0700


alamb opened a new issue #343:
URL: https://github.com/apache/arrow-rs/issues/343



   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Sometimes it is advantageous to split one large `RecordBatch` into smaller 
batches for processing (for example, processing the multiple smaller 
`RecordBatch`es in parallel)
   
   So instead of 1 `RecordBatch` with 1M rows, we could have 100 
`RecordBatch`es with 10,000 rows each that could be processed in paralle. 
   
   @tustvold implemented such a function in 
https://github.com/apache/arrow-datafusion/pull/379/files
   ```
       fn split_batch(sorted: &RecordBatch, batch_size: usize) -> 
Vec<RecordBatch> {
   ```
   
   **Describe the solution you'd like**
   Port the `split_batch` function into `RecordBatch::split(batch_size)` or 
something similar  and add appropriate tests
   
   cc @jorgecarleitao  @nevi-me 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] alamb opened a new issue #343: Add a RecordBatch::split

Reply via email to