[GitHub] [arrow-datafusion] mingmwang commented on issue #5547: Improve the performance of COUNT DISTINCT queries for high cardinality groups

via GitHub Fri, 31 Mar 2023 04:52:00 -0700


mingmwang commented on issue #5547:
URL: 
https://github.com/apache/arrow-datafusion/issues/5547#issuecomment-1491810117


   > 
   SparkSQL's `Expand` approach will expand the rows numbers(more data copy for 
all the agg + group by columns) that flow into the AggregateExec.  DataFusion's 
approach only expand the grouping columns during the evaluation of group by, 
agg columns with not be expanded, less data copy. And it is possible that we 
can implement some specific optimization like avoid calculation the hash values 
for duplicated group columns.  
   Maybe we can add another group stream implementation 
`GroupedExpandHashAggregateStream` and move the GroupingSet specific logic 
there.
   
   I think one reason of the slowness is that DataFusion's aggregation 
framework is more complex, it invoke 3 data structures(the hashmap, the global 
agg state vec,  and the impacted global id idx vec), more complex control flow 
and less efficient memory access pattern, especially group by high cardinality 
columns.
   
   Compared to DataBend, its aggregation framework and control flow is more 
straightforward, the value of the hashmap is just a memory address and the agg 
Accumulators update the memory address directly. It is more unsafe and more 
close to the C++ way. I think they learn it from ClickHouse.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] mingmwang commented on issue #5547: Improve the performance of COUNT DISTINCT queries for high cardinality groups

Reply via email to