[I] `BatchCoalescer` but without automatic batching [arrow-rs]

via GitHub Fri, 14 Nov 2025 16:27:58 -0800


corasaurus-hex opened a new issue, #8850:
URL: https://github.com/apache/arrow-rs/issues/8850


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   I am creating new record batches from a stream of record batches. I need to 
ensure that identical values in a sorted column are always located within the 
same record batch. `BatchCoalescer` isn't a good fit for this but I would like 
to take advantage of the optimized machinery within `BatchCoalescer` to 
accomplish this -- except nothing outside of that crate has access to that 
machinery.
   
   <!--
   A clear and concise description of what the problem is. Ex. I'm always 
frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for 
this feature, in addition to  the *what*)
   -->
   
   **Describe the solution you'd like**
   
   I would like something akin to `BatchCoalescer` but that gives me the 
control to decide when to batch buffered records into a batch (and maybe how 
many of the buffered records should be in the batch?). It must allow me to not 
know what the largest record batch size will be up front.
   
   <!--
   A clear and concise description of what you want to happen.
   -->
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features 
you've considered.
   -->
   
   I created a wrapper around `BatchCoalescer`. I wanted to set an impossibly 
large target batch size, but the problem is that that size is used to allocate 
the arrays for columns up front and so using something like `usize::MAX` isn't 
possible. I instead have to guess at what the maximum batch size will be with 
the trade-off that I don't want it to be too large and use too much memory. If 
I run out of space I create a new `BatchCoalescer` with a larger capacity and 
re-push the data within the previous `BatchCoalescer` into that new 
`BatchCoalescer` before pushing my new records into it.
   
   Alternatively I could just not use any sort of optimized machinery and 
instead slice up record batches myself and then concat them myself as well. 
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   
   @timsaucer also recently had a need for a custom `BatchCoalescer` except he 
needs it to spit out batches when it either reaches a certain number of rows or 
a specific total size in bytes. It seems like there might be more cases out 
there where re-batching based on different criteria is needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] `BatchCoalescer` but without automatic batching [arrow-rs]

Reply via email to