[GitHub] [arrow-datafusion] Dandandan commented on issue #839: Refactor the hash_aggregate

GitBox Sun, 08 Aug 2021 19:54:09 -0700


Dandandan commented on issue #839:
URL: 
https://github.com/apache/arrow-datafusion/issues/839#issuecomment-894918837



   You are right that the current hash aggregate is quite a bit slower in this 
case than it should be.
   
   There is some work already by @alamb to make the hash aggregate faster for 
smaller keys and already gives a ~2x speedup on a tougher query.
   https://github.com/apache/arrow-datafusion/issues/790
   
   I don't think the slow code is in the code you quoted, the `take` is only 
done once for each input array. The slower part just below though works on each 
new input key + input array and does e.g. `slice` on it which has a high 
overhead because of that.
   There are some ideas linked in the issue to deal with that.
   
   There are currently also some other parts in the code that are even 
contributing more to the runtime, such as materializing the end keys/states to 
an array.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan commented on issue #839: Refactor the hash_aggregate

Reply via email to