[GitHub] [arrow-datafusion] alamb commented on issue #956: Make aggregate accumulators storage column-based

GitBox Mon, 06 Dec 2021 14:38:51 -0800


alamb commented on issue #956:
URL: 
https://github.com/apache/arrow-datafusion/issues/956#issuecomment-987313904



   @ic4y 
   
   > I am currently working out ways to solve the performance problem of high 
cardinality aggregation. Follow your method and tested it. I found that there 
is a certain performance improvement, but not ideal enough, only improved by 
about 10% under high base aggregation (I think it needs several times 
performance improvement likes doris and trino's performance under the high 
cardinality aggregation #1246).
   
   Did you do any performance profiling (using `pprof` for example) to know 
where the time is being spent in your query? Is it the aggregate updates? 
Creating the final array? Something else?
   
   There isn't a lot of "low hanging fruit" left in the GroupByHash 
implementation -- to get "several times improvements" in performance I think we 
are likely to have to start special casing (e.g. for single column group by vs 
multi column group by, as well as vectorized aggregators for certain column 
types)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #956: Make aggregate accumulators storage column-based

Reply via email to