alamb commented on issue #17446: URL: https://github.com/apache/datafusion/issues/17446#issuecomment-3347939002
> i tried to disable partial aggregation > > ``` > SET datafusion.execution.skip_partial_aggregation_probe_ratio_threshold=0; > SET datafusion.execution.skip_partial_aggregation_probe_rows_threshold=0; > ``` > > The performance reduced down to 2s, but still significantly slower than polar, i'll need to take a closer look on how their internal execution plan looks like Thanks for this investigation @duongcongtoai > I also notice that AccumulatorState::size takes up to 30% of CPU I think this is an artifact that function of using the (old, per row) [`Accumulator`](https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.Accumulator.html) trait is quite slow compared to the more optimal (but harder to implement) [`GroupsAccumulator`](https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.GroupsAccumulator.html) trait There is more background information on this topic here: https://datafusion.apache.org/blog/2023/08/05/datafusion_fast_grouping/ So I think to be competitive with Polars we will need to implement a `GroupsAccumulator` for array_agg I actually think this might be straight forward since array_agg isn't really aggregating anything Specifically, I think you might be able to create something relatively simple for the state * Save the group assignments and batches as they come in (array_agg needs to store all the rows) * On emit, form the correct outputs by copying to the output - That might also fix https://github.com/apache/datafusion/issues/16517 Note there is a previous attempt at a GroupsAccumulator in this PR from @lkt - https://github.com/apache/datafusion/pull/10149 FYI @gabotechs as you have previously shown interest in improving array_agg's memory profile -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
