[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

sameeragarwal Tue, 12 Apr 2016 18:54:07 -0700

GitHub user sameeragarwal opened a pull request:

    https://github.com/apache/spark/pull/12345


    [SPARK-14447][SQL] Speed up TungstenAggregate w/ keys using AggregateHashMap

    ## What changes were proposed in this pull request?
    
    This patch speeds up group-by aggregates by around 3-5x by leveraging an 
in-memory `AggregateHashMap` (please see 
https://github.com/apache/spark/pull/12161), an append-only aggregate hash map 
that can act as a 'cache' for extremely fast key-value lookups while evaluating 
aggregates (and fall back to the `BytesToBytesMap` if a given key isn't found).
    
    Architecturally, it is backed by a power-of-2-sized array for index lookups 
and a columnar batch that stores the key-value pairs. The index lookups in the 
array rely on linear probing (with a small number of maximum tries) and use an 
inexpensive hash function which makes it really efficient for a majority of 
lookups. However, using linear probing and an inexpensive hash function also 
makes it less robust as compared to the `BytesToBytesMap` (especially for a 
large number of keys or even for certain distribution of keys) and requires us 
to fall back on the latter for correctness.
    
    ## How was this patch tested?
    
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4
        Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
        Aggregate w keys:                   Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
        
-------------------------------------------------------------------------------------------
        codegen = F                              2124 / 2204          9.9       
  101.3       1.0X
        codegen = T hashmap = F                  1198 / 1364         17.5       
   57.1       1.8X
        codegen = T hashmap = T                   369 /  600         56.8       
   17.6       5.8X

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sameeragarwal/spark 
tungsten-aggregate-integration

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12345.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12345
    
----
commit 7c158bd137f057453d17ef360906e5be90bf5004
Author: Sameer Agarwal <[email protected]>
Date:   2016-03-31T21:15:34Z

    [SPARK-14394]

commit ebaea6a87b704afedd47bdd2dd17c92c3ffc6e8e
Author: Sameer Agarwal <[email protected]>
Date:   2016-04-07T00:37:08Z

    Integrating AggregateHashMap for Aggregates with Group By

commit cee7e65b3cf7569b4e46941158f164c2130c3981
Author: Sameer Agarwal <[email protected]>
Date:   2016-04-12T17:33:42Z

    Add SQLConf

commit 8c9e17a1d40e3014e39b1d04f3a458aa129784f8
Author: Sameer Agarwal <[email protected]>
Date:   2016-04-12T23:01:03Z

    20ns

commit 3379294b76d91a55dbe86e31efb9812c8d37768c
Author: Sameer Agarwal <[email protected]>
Date:   2016-04-12T23:18:36Z

    generated code

commit 4ee56873764d62efdaf8c47cb74aa399f2194fde
Author: Sameer Agarwal <[email protected]>
Date:   2016-04-13T01:23:27Z

    benchmark

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Reply via email to