westonpace commented on issue #11799: URL: https://github.com/apache/arrow/issues/11799#issuecomment-984224525
The `hash_*` functions all take, as the last argument, a uint32 array of group ids which explains the error. However, even if you were to correct this you would get the error: `Direct execution of HASH_AGGREGATE functions`. At the time (6.0.1) this was prevented. I think because we didn't want to confuse people that expected something more like a "group by" operation that both computes the group ids and performs the aggregate. The calculation of group ids is not currently exposed because it is stateful and we haven't exposed any stateful kernels. So, at the moment, I think you may be out of luck with 6.0.1. I believe the only way to utilize these would be to use the query plan directly and that hasn't been documented yet (other than dplyr). Work is being done to document the query plans in C++ and to expose the query plans in python via ibis and there has been some discussion on exposing the hash kernels and grouping kernels directly as it could be useful and simple for "dataset-in-memory" operations. And, of course, there is the approach that Alenka shared. So I think you will have several options once 7.0.0 releases. If you're interested in an undocumented C++ approach I could share a snippet on how you could use a query plan to accomplish what you want. Although I would have to know a bit more what you were trying to accomplish. Were you intending to group by the string column and return the sums of the int64 column? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
