kazuyukitanimura commented on issue #7858:
URL: 
https://github.com/apache/arrow-datafusion/issues/7858#issuecomment-1769584341

   Thank you @milenkovicm for stress-testing and the improvement ideas.
   Here are my quick thoughts;
   
   For 1 & 2, I think we can break down the `emit` into smaller chunks even 
before `sort_batch()`. We can flush sorted chunks to disk before start sorting 
the next groups of unsorted chunks. In that way, we can reduce (control) the 
memory footprint for sorting. We still need but a smaller budget for sorting. 
This is somewhat a mixed solution of @alamb 's two strategies.
   
   By this comment
   
https://github.com/apache/arrow-datafusion/blob/7acd8833cc5d03ba7643d4ae424553c7681ccce8/datafusion/physical-plan/src/aggregates/row_hash.rs#L675
   the plan was to break down a large chunk and writing in parallel (maybe 
pooling) so that single thread disk writing won't be blocking. 
   
   For 3, if `ResourcesExhausted` happens but `group_values.len()` is shorter 
than `self.batch_size`, that means even a single batch worth of data did not 
fit in memory. In that case, we need to stop proceeding from that point. The 
assumption is to have a memory conf to hold at least one batch in memory. 
Alternatively `batch_size` can be reduced. I think we still need these checks 
even we spill smaller data at a time.
   
   Cheers!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to