jorgecarleitao opened a new pull request #8172:
URL: https://github.com/apache/arrow/pull/8172


   This PR is a proposal to fix 4 issues in our aggregations:
   
   1. averages are incorrect
   2. they only support aggregations that can be reduced using a single value 
(reason for issue 1.)
   3. they do not leverage arrow’s aggregate kernels nor memory layout
   4. they only support a single column
   
   The proposal is written here: 
https://docs.google.com/document/d/1n-GS103ih3QIeQMbf_zyDStjUmryRQd45ypgk884LHU/edit#
   
   Its main features:
   
   1. adds a test of a wrong average and fixes it
   1. makes `ScalarValue` a nullable dynamic type, which is closer to how 
`Array` works (it is nullable)
   2. Accumulators now know how to be updated from values (partial) and from 
other accumulators' state (final)
   3. Accumulators can now receive more than one column
   4. AggregateExec now knows how to serialize aggregators' state into a 
ArrayRef's, so that they can be passed throughout the execution
   5. Aggregations are now always made in two steps: partial (update from 
values) and full (update from other's states)
   6. MergeExec merges batches in a single batch to reduce batch fragmentation
   7. Aggregations leverage arrow's kernels as much as possible (all aggregates 
+ take + concatenate)
   
   This PR is built on top of 3 PRs that are under review, and thus is only a 
draft at this point. 
   
   The benchmarks are between -30% and +15%. Given that the computation now 
always requires two passes, I was sufficiently happy with them. More can be 
achieved later.
   
   I am still evaluating the reason for the `aggregate_query_group_by`, but 
given the functionality that it adds, I considered it sufficiently good for 
some initial discussions, @andygrove , @nevi-me , @alamb @paddyhoran .
   
   The benchmarks were updated to better reflect real data, and the results are 
as follows:
   
   ```
   aggregate_query_no_group_by 15 12                                            
                                
                           time:   [478.23 us 479.62 us 480.98 us]
                           change: [-29.686% -27.511% -25.784%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 7 outliers among 100 measurements (7.00%)
     2 (2.00%) high mild
     5 (5.00%) high severe
   
   aggregate_query_group_by 15 12                                               
                              
                           time:   [2.8689 ms 2.8794 ms 2.8922 ms]
                           change: [+12.971% +13.710% +14.445%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 5 outliers among 100 measurements (5.00%)
     3 (3.00%) high mild
     2 (2.00%) high severe
   
   aggregate_query_group_by_with_filter 15 12                                   
                                          
                           time:   [2.1803 ms 2.2062 ms 2.2330 ms]
                           change: [-8.2400% -6.7872% -5.3209%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   ```
   
   Sorry for the long PR, but this was a relatively difficult PR to achieve, as 
it required refactoring of some of our most delicate components. I will try to 
split it in smaller parts to each the review.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to