corasaurus-hex commented on issue #8850:
URL: https://github.com/apache/arrow-rs/issues/8850#issuecomment-3543904483

   Here's what I landed on and it seems to be very fast for my use case: 
https://gist.github.com/corasaurus-hex/1754460af46b84a2a13ca034e86e5676
   
   This seems generic enough that it could power a lot of different solutions, 
even something like `BatchCoalescer`.
   
   > without additional constraints this could result in an unbounded use of 
memory
   
   Without getting too into the problem, I **do** have additional constraints, 
constraints in the physical world, that limit this. The input data can change 
over time but the pathological cases have an upper bound (I'm not 100% on what 
that upper bound is but the worst case is ~300k-ish right now and that's likely 
very close). I'd like to keep using as little memory as possible in the other 
cases, though, so pre-allocating that amount for every batch isn't ideal.
   
   > Another approach I have seen used is to store the same data as a 
Vec<RecordBatch>, and call slice() to split batches where the sort key changes 
(so the batches line up nicely with the same keys)
   
   The code I used above is similar to that but without `partition` (but I'm 
using `partition` elsewhere for a join algorithm, it's lovely). I'm not using 
`partition` in this case since I think it'll partition the entire batch and in 
the vast majority of cases I only need to seek to a position and read a few 
records past that to find a boundary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to