Dandandan commented on issue #839: URL: https://github.com/apache/arrow-datafusion/issues/839#issuecomment-894918837
You are right that the current hash aggregate is quite a bit slower in this case than it should be. There is some work already by @alamb to make the hash aggregate faster for smaller keys and already gives a ~2x speedup on a tougher query. https://github.com/apache/arrow-datafusion/issues/790 I don't think the slow code is in the code you quoted, the `take` is only done once for each input array. The slower part just below though works on each new input key + input array and does e.g. `slice` on it which has a high overhead because of that. There are some ideas linked in the issue to deal with that. There are currently also some other parts in the code that are even contributing more to the runtime, such as materializing the end keys/states to an array. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
