GitHub user sameeragarwal opened a pull request:
https://github.com/apache/spark/pull/12345
[SPARK-14447][SQL] Speed up TungstenAggregate w/ keys using AggregateHashMap
## What changes were proposed in this pull request?
This patch speeds up group-by aggregates by around 3-5x by leveraging an
in-memory `AggregateHashMap` (please see
https://github.com/apache/spark/pull/12161), an append-only aggregate hash map
that can act as a 'cache' for extremely fast key-value lookups while evaluating
aggregates (and fall back to the `BytesToBytesMap` if a given key isn't found).
Architecturally, it is backed by a power-of-2-sized array for index lookups
and a columnar batch that stores the key-value pairs. The index lookups in the
array rely on linear probing (with a small number of maximum tries) and use an
inexpensive hash function which makes it really efficient for a majority of
lookups. However, using linear probing and an inexpensive hash function also
makes it less robust as compared to the `BytesToBytesMap` (especially for a
large number of keys or even for certain distribution of keys) and requires us
to fall back on the latter for correctness.
## How was this patch tested?
Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4
Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
Aggregate w keys: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
-------------------------------------------------------------------------------------------
codegen = F 2124 / 2204 9.9
101.3 1.0X
codegen = T hashmap = F 1198 / 1364 17.5
57.1 1.8X
codegen = T hashmap = T 369 / 600 56.8
17.6 5.8X
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sameeragarwal/spark
tungsten-aggregate-integration
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12345.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12345
----
commit 7c158bd137f057453d17ef360906e5be90bf5004
Author: Sameer Agarwal <[email protected]>
Date: 2016-03-31T21:15:34Z
[SPARK-14394]
commit ebaea6a87b704afedd47bdd2dd17c92c3ffc6e8e
Author: Sameer Agarwal <[email protected]>
Date: 2016-04-07T00:37:08Z
Integrating AggregateHashMap for Aggregates with Group By
commit cee7e65b3cf7569b4e46941158f164c2130c3981
Author: Sameer Agarwal <[email protected]>
Date: 2016-04-12T17:33:42Z
Add SQLConf
commit 8c9e17a1d40e3014e39b1d04f3a458aa129784f8
Author: Sameer Agarwal <[email protected]>
Date: 2016-04-12T23:01:03Z
20ns
commit 3379294b76d91a55dbe86e31efb9812c8d37768c
Author: Sameer Agarwal <[email protected]>
Date: 2016-04-12T23:18:36Z
generated code
commit 4ee56873764d62efdaf8c47cb74aa399f2194fde
Author: Sameer Agarwal <[email protected]>
Date: 2016-04-13T01:23:27Z
benchmark
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]