[GitHub] [arrow-datafusion] tustvold commented on issue #4973: Improve the performance of `Aggregator`, grouping, aggregation

via GitHub Tue, 27 Jun 2023 00:23:18 -0700


tustvold commented on issue #4973:
URL: 
https://github.com/apache/arrow-datafusion/issues/4973#issuecomment-1608940763


   > The per-group type dispatching overhead is not the major bottleneck
   
   I agree 100%, the issue is the per-group dispatching and state-management 
overheads. This is what destroys performance for high-cardinality grouping and 
is what the proposal aims to address. The fact the proposal also eliminates the 
type-dispatching overheads is a happy accident.
   
   > If we let each Accumulator manages its own state, I guess there will be 
more random memory access.
   
   The output of the accumulators is a columnar representation per aggregator. 
The current approach uses a row format to store intermediate state, converting 
back to this columnar representation on flush. It is unclear to me why this 
would be advantageous, if anything it seems like it would lead to more random 
memory accesses.
   
   > The random memory access and mem stall should be the major bottleneck in 
the case of high cardinality grouping
   
   Assuming relatively cheap aggregators like count or max, I personally would 
expect the hash table logic to be the bottleneck, but perhaps that is what you 
are alluding to here. The cost of updating the aggregators once the groups have 
been identified should be a rounding error if done properly in a columnar 
fashion.
   
   I suspect rather than having discussions about theoretical performance 
characteristics, it is probably worth prototyping and collecting some empirical 
data. I can try to find some time in the coming weeks. Whilst I expect good 
things, even if the performance is only an incremental improvement, I think the 
proposal provides a huge improvement to code quality and maintainability which 
I personally think is extremely important.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] tustvold commented on issue #4973: Improve the performance of `Aggregator`, grouping, aggregation

Reply via email to