mingmwang commented on issue #5547: URL: https://github.com/apache/arrow-datafusion/issues/5547#issuecomment-1491810117
> SparkSQL's `Expand` approach will expand the rows numbers(more data copy for all the agg + group by columns) that flow into the AggregateExec. DataFusion's approach only expand the grouping columns during the evaluation of group by, agg columns with not be expanded, less data copy. And it is possible that we can implement some specific optimization like avoid calculation the hash values for duplicated group columns. Maybe we can add another group stream implementation `GroupedExpandHashAggregateStream` and move the GroupingSet specific logic there. I think one reason of the slowness is that DataFusion's aggregation framework is more complex, it invoke 3 data structures(the hashmap, the global agg state vec, and the impacted global id idx vec), more complex control flow and less efficient memory access pattern, especially group by high cardinality columns. Compared to DataBend, its aggregation framework and control flow is more straightforward, the value of the hashmap is just a memory address and the agg Accumulators update the memory address directly. It is more unsafe and more close to the C++ way. I think they learn it from ClickHouse. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
