alamb commented on pull request #8172: URL: https://github.com/apache/arrow/pull/8172#issuecomment-691324132
I plan to review this carefully tomorrow morning US eastern time On Fri, Sep 11, 2020 at 5:36 PM Jorge Leitao <[email protected]> wrote: > This PR is a proposal to fix 4 issues in our aggregations: > > 1. averages are incorrect > 2. they only support aggregations that can be reduced using a single > value (reason for issue 1.) > 3. they do not leverage arrow’s aggregate kernels nor memory layout > 4. they only support a single column > > The proposal is written here: > https://docs.google.com/document/d/1n-GS103ih3QIeQMbf_zyDStjUmryRQd45ypgk884LHU/edit# > > Its main features: > > 1. adds a test of a wrong average and fixes it > 2. makes ScalarValue a nullable dynamic type, which is closer to how > Array works (it is nullable) > 3. Accumulators now know how to be updated from values (partial) and > from other accumulators' state (final) > 4. Accumulators can now receive more than one column > 5. AggregateExec now knows how to serialize aggregators' state into a > ArrayRef's, so that they can be passed throughout the execution > 6. Aggregations are now always made in two steps: partial (update from > values) and full (update from other's states) > 7. MergeExec merges batches in a single batch to reduce batch > fragmentation > 8. Aggregations leverage arrow's kernels as much as possible (all > aggregates + take + concatenate) > > This PR is built on top of 3 PRs that are under review, and thus is only a > draft at this point. > > The benchmarks are between -30% and +15%. Given that the computation now > always requires two passes, I was sufficiently happy with them. More can be > achieved later. > > I am still evaluating the reason for the aggregate_query_group_by, but > given the functionality that it adds, I considered it sufficiently good for > some initial discussions, @andygrove <https://github.com/andygrove> , > @nevi-me <https://github.com/nevi-me> , @alamb <https://github.com/alamb> > @paddyhoran <https://github.com/paddyhoran> . > > The benchmarks were updated to better reflect real data, and the results > are as follows: > > aggregate_query_no_group_by 15 12 > > time: [478.23 us 479.62 us 480.98 us] > > change: [-29.686% -27.511% -25.784%] (p = 0.00 < 0.05) > > Performance has improved. > > Found 7 outliers among 100 measurements (7.00%) > > 2 (2.00%) high mild > > 5 (5.00%) high severe > > > > aggregate_query_group_by 15 12 > > time: [2.8689 ms 2.8794 ms 2.8922 ms] > > change: [+12.971% +13.710% +14.445%] (p = 0.00 < 0.05) > > Performance has regressed. > > Found 5 outliers among 100 measurements (5.00%) > > 3 (3.00%) high mild > > 2 (2.00%) high severe > > > > aggregate_query_group_by_with_filter 15 12 > > time: [2.1803 ms 2.2062 ms 2.2330 ms] > > change: [-8.2400% -6.7872% -5.3209%] (p = 0.00 < 0.05) > > Performance has improved. > > Found 1 outliers among 100 measurements (1.00%) > > 1 (1.00%) high mild > > > Sorry for the long PR, but this was a relatively difficult PR to achieve, > as it required refactoring of some of our most delicate components. I will > try to split it in smaller parts to each the review. > ------------------------------ > You can view, comment on, or merge this pull request online at: > > https://github.com/apache/arrow/pull/8172 > Commit Summary > > - Made aggregates support same signatures as functions. > - Added test of average. > - Added bench. > - Added min/max of StringArray > - Improved performance of take. > - Refactored aggregate expressions to support complex aggregations. > > File Changes > > - *M* rust/arrow/src/compute/kernels/aggregate.rs > <https://github.com/apache/arrow/pull/8172/files#diff-d429b07b78d669cf4b46fbb939b2c2b4> > (69) > - *M* rust/arrow/src/compute/kernels/take.rs > <https://github.com/apache/arrow/pull/8172/files#diff-f8e948e2041b5ec28e1ed52afa033c7a> > (205) > - *M* rust/datafusion/Cargo.toml > <https://github.com/apache/arrow/pull/8172/files#diff-3a31d76ae91bd469dc925ad19fae784e> > (1) > - *M* rust/datafusion/benches/aggregate_query_sql.rs > <https://github.com/apache/arrow/pull/8172/files#diff-d651493b657e5fbc784373f751eeb0ac> > (134) > - *M* rust/datafusion/src/execution/context.rs > <https://github.com/apache/arrow/pull/8172/files#diff-8273e76b6910baa123f3a25a967af3b5> > (59) > - *M* rust/datafusion/src/lib.rs > <https://github.com/apache/arrow/pull/8172/files#diff-86191acf7b8e35ce12a84dbb4c4aa80a> > (1) > - *M* rust/datafusion/src/logical_plan/mod.rs > <https://github.com/apache/arrow/pull/8172/files#diff-9922d86c805a8dc858387eb47612caae> > (185) > - *M* rust/datafusion/src/optimizer/filter_push_down.rs > <https://github.com/apache/arrow/pull/8172/files#diff-4de075a6e9df08cd9c9f37333c5dffb1> > (14) > - *M* rust/datafusion/src/optimizer/projection_push_down.rs > <https://github.com/apache/arrow/pull/8172/files#diff-6d2de6d63c825bb8b553d876e2f8a0db> > (2) > - *M* rust/datafusion/src/optimizer/utils.rs > <https://github.com/apache/arrow/pull/8172/files#diff-541a57a1eedc375627a8b9ef9bb82a99> > (4) > - *A* rust/datafusion/src/physical_plan/aggregates.rs > <https://github.com/apache/arrow/pull/8172/files#diff-fd1edcd337781e9ec6538ef3c4591f3d> > (156) > - *M* rust/datafusion/src/physical_plan/common.rs > <https://github.com/apache/arrow/pull/8172/files#diff-a0bf07ce0c251f5bb6acbaa1981c6344> > (124) > - *M* rust/datafusion/src/physical_plan/expressions.rs > <https://github.com/apache/arrow/pull/8172/files#diff-a98d5d588d3c5b525c6840271a5bdddc> > (1460) > - *M* rust/datafusion/src/physical_plan/filter.rs > <https://github.com/apache/arrow/pull/8172/files#diff-5a5d8d45d9149e76b5f98ba90b32ec2b> > (7) > - *M* rust/datafusion/src/physical_plan/functions.rs > <https://github.com/apache/arrow/pull/8172/files#diff-58faffded5f7be9bf6b08dffbcd9eb8b> > (18) > - *M* rust/datafusion/src/physical_plan/hash_aggregate.rs > <https://github.com/apache/arrow/pull/8172/files#diff-a088cd6cbf2e53a43867b69737c485fe> > (796) > - *M* rust/datafusion/src/physical_plan/mod.rs > <https://github.com/apache/arrow/pull/8172/files#diff-444c9d0d823509bd21c36adcf5a58a59> > (68) > - *M* rust/datafusion/src/physical_plan/planner.rs > <https://github.com/apache/arrow/pull/8172/files#diff-8a7abbf6e4953d964b49b836c832b478> > (86) > - *M* rust/datafusion/src/physical_plan/type_coercion.rs > <https://github.com/apache/arrow/pull/8172/files#diff-282dd66764bc913288d68b8c748ca1c0> > (18) > - *A* rust/datafusion/src/scalar.rs > <https://github.com/apache/arrow/pull/8172/files#diff-d2edb06c8da67de0a7f8ae35e7e7e835> > (232) > - *M* rust/datafusion/src/sql/planner.rs > <https://github.com/apache/arrow/pull/8172/files#diff-21b952259e12c1adf2098d9a797caef1> > (76) > - *M* rust/datafusion/src/test/mod.rs > <https://github.com/apache/arrow/pull/8172/files#diff-fb5577d6be40dc90593d523c89fa0f46> > (16) > - *M* rust/datafusion/src/test/variable.rs > <https://github.com/apache/arrow/pull/8172/files#diff-4ab82593dbaf24ea43bad04292f79a1f> > (6) > - *M* rust/datafusion/src/variable/mod.rs > <https://github.com/apache/arrow/pull/8172/files#diff-a0eb1850b19b4e304a0345696bffed77> > (2) > > Patch Links: > > - https://github.com/apache/arrow/pull/8172.patch > - https://github.com/apache/arrow/pull/8172.diff > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/arrow/pull/8172>, or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AADXZMLPBDZORLFYARREPXDSFKJ45ANCNFSM4RIMHT7Q> > . > ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
