Rachelint commented on PR #11943: URL: https://github.com/apache/datafusion/pull/11943#issuecomment-2288564387
> > @2010YOUY01 After checking the codes about memory contorl, I think I got it. > > > > * `emit_early_if_necessary` is used in `Partial` > > * and `spill_previous_if_necessary` is used in the final phases > > > > They all serve for the spilling. And the logic may be like this: > > > > * After reaching the memory limit, force the `Partial` to submit batches to `Final` as soon as possible > > * And the `Final` will spill them to disk for avoid oom > > * After all batches are submitted to `Final`, the `Final` merged the spilled batches and in-memory batches to get the final results (in streaming agg way, batches will be sorted before spilling). > > Thanks, now I figured out the high-level idea of spilling in aggregation and how `emit` works in its implementation. > > However there exists other code that does early emit in aggregation, and I'm still trying to figure out how they work, do you have any pointer for that? I'm guessing it's used in streaming aggregation or some pushed-down limits > > https://github.com/apache/datafusion/blob/482ef4551a4828825da8deb29d222fa82e1cfaa9/datafusion/physical-plan/src/aggregates/row_hash.rs#L605-L611 Yes, you are right, there are two early emission cases, one is for spilling mentioned above, and another here is about streaming. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org