corasaurus-hex opened a new issue, #8850: URL: https://github.com/apache/arrow-rs/issues/8850
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** I am creating new record batches from a stream of record batches. I need to ensure that identical values in a sorted column are always located within the same record batch. `BatchCoalescer` isn't a good fit for this but I would like to take advantage of the optimized machinery within `BatchCoalescer` to accomplish this -- except nothing outside of that crate has access to that machinery. <!-- A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] (This section helps Arrow developers understand the context and *why* for this feature, in addition to the *what*) --> **Describe the solution you'd like** I would like something akin to `BatchCoalescer` but that gives me the control to decide when to batch buffered records into a batch (and maybe how many of the buffered records should be in the batch?). It must allow me to not know what the largest record batch size will be up front. <!-- A clear and concise description of what you want to happen. --> **Describe alternatives you've considered** <!-- A clear and concise description of any alternative solutions or features you've considered. --> I created a wrapper around `BatchCoalescer`. I wanted to set an impossibly large target batch size, but the problem is that that size is used to allocate the arrays for columns up front and so using something like `usize::MAX` isn't possible. I instead have to guess at what the maximum batch size will be with the trade-off that I don't want it to be too large and use too much memory. If I run out of space I create a new `BatchCoalescer` with a larger capacity and re-push the data within the previous `BatchCoalescer` into that new `BatchCoalescer` before pushing my new records into it. Alternatively I could just not use any sort of optimized machinery and instead slice up record batches myself and then concat them myself as well. **Additional context** <!-- Add any other context or screenshots about the feature request here. --> @timsaucer also recently had a need for a custom `BatchCoalescer` except he needs it to spit out batches when it either reaches a certain number of rows or a specific total size in bytes. It seems like there might be more cases out there where re-batching based on different criteria is needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
