alamb commented on issue #956: URL: https://github.com/apache/arrow-datafusion/issues/956#issuecomment-987313904
@ic4y > I am currently working out ways to solve the performance problem of high cardinality aggregation. Follow your method and tested it. I found that there is a certain performance improvement, but not ideal enough, only improved by about 10% under high base aggregation (I think it needs several times performance improvement likes doris and trino's performance under the high cardinality aggregation #1246). Did you do any performance profiling (using `pprof` for example) to know where the time is being spent in your query? Is it the aggregate updates? Creating the final array? Something else? There isn't a lot of "low hanging fruit" left in the GroupByHash implementation -- to get "several times improvements" in performance I think we are likely to have to start special casing (e.g. for single column group by vs multi column group by, as well as vectorized aggregators for certain column types) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org