Re: [I] Improvements in HashAggregationExec when spilling [arrow-datafusion]

via GitHub Wed, 18 Oct 2023 16:41:09 -0700


kazuyukitanimura commented on issue #7858:
URL: 
https://github.com/apache/arrow-datafusion/issues/7858#issuecomment-1769584341

Thank you @milenkovicm for stress-testing and the improvement ideas.
Here are my quick thoughts;

For 1 & 2, I think we can break down the `emit` into smaller chunks even
before `sort_batch()`. We can flush sorted chunks to disk before start sorting
the next groups of unsorted chunks. In that way, we can reduce (control) the
memory footprint for sorting. We still need but a smaller budget for sorting.
This is somewhat a mixed solution of @alamb 's two strategies.

By this comment

https://github.com/apache/arrow-datafusion/blob/7acd8833cc5d03ba7643d4ae424553c7681ccce8/datafusion/physical-plan/src/aggregates/row_hash.rs#L675
the plan was to break down a large chunk and writing in parallel (maybe
pooling) so that single thread disk writing won't be blocking.

For 3, if `ResourcesExhausted` happens but `group_values.len()` is shorter
than `self.batch_size`, that means even a single batch worth of data did not
fit in memory. In that case, we need to stop proceeding from that point. The
assumption is to have a memory conf to hold at least one batch in memory.
Alternatively `batch_size` can be reduced. I think we still need these checks
even we spill smaller data at a time.

Cheers!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Improvements in HashAggregationExec when spilling [arrow-datafusion]

Reply via email to