[
https://issues.apache.org/jira/browse/ARROW-11300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jorge Leitão resolved ARROW-11300.
----------------------------------
Fix Version/s: 4.0.0
Resolution: Fixed
Issue resolved by pull request 9271
[https://github.com/apache/arrow/pull/9271]
> [Rust][DataFusion] Improve hash aggregate performance with large number of
> groups in
> -------------------------------------------------------------------------------------
>
> Key: ARROW-11300
> URL: https://issues.apache.org/jira/browse/ARROW-11300
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Rust - DataFusion
> Reporter: Daniël Heres
> Assignee: Daniël Heres
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: image-2021-01-18-13-00-36-685.png
>
> Time Spent: 4h 40m
> Remaining Estimate: 0h
>
> Currently, hash aggregates are performing well when having a small number of
> output groups, but the results on db-benchmark
> [https://github.com/h2oai/db-benchmark/pull/182] test on data with a high
> number of output groups.
> [https://github.com/apache/arrow/pull/9234] improved the situation a bit, but
> DataFusion is still much slower than even the slowest result when comparing
> to the published results.
> This seems mostly having to do with the way we use individual key/groups.
> For each new key, we _take_ the indices of the group, resulting in lots of
> small allocations and cache unfriendliness and other overhead if we have many
> keys with only a small (just 1-2) number of rows per group in a batch. Also
> the indices are converted from a Vec to an Array, making the situation worse
> (accounts for ~22% of the instructions on the master branch!), other
> profiling results seem to be from related allocations too.
> To make it efficient for tiny groups, we should probably change the hash
> aggregate algorithm to _take_ based on _all_ indices from the batch in one
> go, and "slice" into the resulting array for the individual accumulators.
>
> Here is some profiling info of the db-benchmark questions 1-5 against master:
> !image-2021-01-18-13-00-36-685.png!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)