corasaurus-hex commented on issue #8850: URL: https://github.com/apache/arrow-rs/issues/8850#issuecomment-3543904483
Here's what I landed on and it seems to be very fast for my use case: https://gist.github.com/corasaurus-hex/1754460af46b84a2a13ca034e86e5676 This seems generic enough that it could power a lot of different solutions, even something like `BatchCoalescer`. > without additional constraints this could result in an unbounded use of memory Without getting too into the problem, I **do** have additional constraints, constraints in the physical world, that limit this. The input data can change over time but the pathological cases have an upper bound (I'm not 100% on what that upper bound is but the worst case is ~300k-ish right now and that's likely very close). I'd like to keep using as little memory as possible in the other cases, though, so pre-allocating that amount for every batch isn't ideal. > Another approach I have seen used is to store the same data as a Vec<RecordBatch>, and call slice() to split batches where the sort key changes (so the batches line up nicely with the same keys) The code I used above is similar to that but without `partition` (but I'm using `partition` elsewhere for a join algorithm, it's lovely). I'm not using `partition` in this case since I think it'll partition the entire batch and in the vast majority of cases I only need to seek to a position and read a few records past that to find a boundary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
