rluvaton commented on code in PR #19287:
URL: https://github.com/apache/datafusion/pull/19287#discussion_r2624991342
##########
datafusion/physical-plan/src/aggregates/row_hash.rs:
##########
@@ -550,26 +569,37 @@ impl GroupedHashAggregateStream {
.collect::<Vec<_>>()
.join(", ");
let name = format!("GroupedHashAggregateStream[{partition}]
({agg_fn_names})");
- let reservation = MemoryConsumer::new(name)
- .with_can_spill(true)
- .register(context.memory_pool());
let group_ordering = GroupOrdering::try_new(&agg.input_order_mode)?;
+ let oom_mode = match group_ordering {
+ GroupOrdering::None => {
+ if agg.mode == AggregateMode::Partial {
+ OutOfMemoryMode::EmitEarly
+ } else {
+ OutOfMemoryMode::Spill
+ }
+ }
+ GroupOrdering::Partial(_) | GroupOrdering::Full(_) =>
OutOfMemoryMode::Spill,
Review Comment:
I don't think you should support the regular spill in group ordering full.
as there are 2 cases where you need spill:
1. too many groups
2. aggregate expression takes too much space and we didn't finish the first
group (as if you have more than 1 group this is case 1)
in the first case, emitting early will solve the problem and you can do that
both in partial and final as you know that you won't get more values for that
group
in the second case, regular spilling is not needed as it involves doing
multi level merge sort which will try to sort a single key which is useless.
instead when ordering is full, you just read them back and merge as you go,
and if you don't have enough memory while doing merging, than we can spill and
continue with the merge that include that new spill file (make sure to bail to
avoid infinite loop of spill and read)
not have to be in this pr.
If I'm not mistaken currently when there is a spill it will wait for all
data from all groups to be received and spill them and then will move to the
merge mode. which for group ordering full is not needed, once we finished the
group we can read from spill files and do the merge on that group only, so it
combine spill and emit early
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]