jorgecarleitao opened a new pull request #8172: URL: https://github.com/apache/arrow/pull/8172
This PR is a proposal to fix 4 issues in our aggregations: 1. averages are incorrect 2. they only support aggregations that can be reduced using a single value (reason for issue 1.) 3. they do not leverage arrow’s aggregate kernels nor memory layout 4. they only support a single column The proposal is written here: https://docs.google.com/document/d/1n-GS103ih3QIeQMbf_zyDStjUmryRQd45ypgk884LHU/edit# Its main features: 1. adds a test of a wrong average and fixes it 1. makes `ScalarValue` a nullable dynamic type, which is closer to how `Array` works (it is nullable) 2. Accumulators now know how to be updated from values (partial) and from other accumulators' state (final) 3. Accumulators can now receive more than one column 4. AggregateExec now knows how to serialize aggregators' state into a ArrayRef's, so that they can be passed throughout the execution 5. Aggregations are now always made in two steps: partial (update from values) and full (update from other's states) 6. MergeExec merges batches in a single batch to reduce batch fragmentation 7. Aggregations leverage arrow's kernels as much as possible (all aggregates + take + concatenate) This PR is built on top of 3 PRs that are under review, and thus is only a draft at this point. The benchmarks are between -30% and +15%. Given that the computation now always requires two passes, I was sufficiently happy with them. More can be achieved later. I am still evaluating the reason for the `aggregate_query_group_by`, but given the functionality that it adds, I considered it sufficiently good for some initial discussions, @andygrove , @nevi-me , @alamb @paddyhoran . The benchmarks were updated to better reflect real data, and the results are as follows: ``` aggregate_query_no_group_by 15 12 time: [478.23 us 479.62 us 480.98 us] change: [-29.686% -27.511% -25.784%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 2 (2.00%) high mild 5 (5.00%) high severe aggregate_query_group_by 15 12 time: [2.8689 ms 2.8794 ms 2.8922 ms] change: [+12.971% +13.710% +14.445%] (p = 0.00 < 0.05) Performance has regressed. Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe aggregate_query_group_by_with_filter 15 12 time: [2.1803 ms 2.2062 ms 2.2330 ms] change: [-8.2400% -6.7872% -5.3209%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild ``` Sorry for the long PR, but this was a relatively difficult PR to achieve, as it required refactoring of some of our most delicate components. I will try to split it in smaller parts to each the review. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
