GitHub user ericl opened a pull request:

    https://github.com/apache/spark/pull/12490

    [SPARK-14724] Use radix sort for shuffles and sort operator when possible

    ## What changes were proposed in this pull request?
    
    Spark currently uses TimSort for all in-memory sorts, including sorts done 
for shuffle. One low-hanging fruit is to use radix sort when possible (e.g. 
sorting by integer keys). This PR adds a radix sort implementation to the 
unsafe sort package and switches shuffles and sorts to use it when possible.
    
    ## How was this patch tested?
    
    Unit tests, enabling radix sort on existing tests. Microbenchmark results:
    
    ```
    Running benchmark: radix sort 25000000
    OpenJDK 64-Bit Server VM 1.8.0_66-internal-b17 on Linux 4.2.0-35-generic
    Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
    
    radix sort 25000000:                Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
    
-------------------------------------------------------------------------------------------
    reference Arrays.sort                    1939 / 1960         12.9          
77.6       1.0X
    radix sort one byte                       127 /  159        196.7           
5.1      15.3X
    radix sort two bytes                      239 /  311        104.4           
9.6       8.1X
    radix sort eight bytes                    896 /  905         27.9          
35.8       2.2X
    radix sort key prefix array              1435 / 1503         17.4          
57.4       1.4X
    ```
    
    I also ran a mix of the supported TPCDS queries and compared TimSort vs 
RadixSort metrics. The overall benchmark ran ~10% faster with radix sort on. 
From the breakdown below, the actual sort phases were about 20x faster compared 
to TimSort (which suffers from many function call overheads), however sorting 
is only a small fraction of the overall runtime. About half of the TPCDS 
queries were able to take advantage of radix sort (probably doing joins on long 
keys).
    
    ```
    TPCDS on master: 2499s real time, 8185s executor
        - 1171s in TimSort, avg 267 MB/s
    (note the /s accounting is weird here since dataSize counts the record 
sizes too)
    
    TPCDS with radix enabled: 2294s real time, 7391s executor
        - 596s in TimSort, avg 254 MB/s
        - 26s in radix sort, avg 4.2 GB/s
    ```
    
    cc @davies @rxin 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ericl/spark sort-benchmark

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12490.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12490
    
----
commit 54da42e1fdfd6b8f1a288fb9015492ae7de5ef03
Author: Eric Liang <[email protected]>
Date:   2016-04-14T01:49:52Z

    wip

commit b71c9ccc3d6c27b799f5526a1813401be23867b8
Author: Eric Liang <[email protected]>
Date:   2016-04-14T03:23:47Z

    Wed Apr 13 20:23:47 PDT 2016

commit ad23ca77275829155e47e25cfc9255acf60f4992
Author: Eric Liang <[email protected]>
Date:   2016-04-14T18:28:10Z

    Thu Apr 14 11:28:10 PDT 2016

commit 52d70f86275ea24d4fa0a9a2a0e2f3ad10ada3db
Author: Eric Liang <[email protected]>
Date:   2016-04-14T21:36:20Z

    try supporting sort op

commit a6fcd2e9eed34a5c8847dfcc962119a00f6d87ca
Author: Eric Liang <[email protected]>
Date:   2016-04-14T23:42:43Z

    Thu Apr 14 16:42:42 PDT 2016

commit 1e55785bcdaa2fd32017aa007bb9f8e9e6902af1
Author: Eric Liang <[email protected]>
Date:   2016-04-15T00:37:04Z

    Thu Apr 14 17:37:04 PDT 2016

commit 02969e3c68bb0792280324fa4ca8dc897f3355fb
Author: Eric Liang <[email protected]>
Date:   2016-04-15T01:06:25Z

    Thu Apr 14 18:06:24 PDT 2016

commit a1451dfb9be3cfac1929a794f1d37bcaea8abc3e
Author: Eric Liang <[email protected]>
Date:   2016-04-15T20:18:04Z

    Fri Apr 15 13:18:04 PDT 2016

commit 28275a4af9bd47de6118b018df4bd49c8ef296a8
Author: Eric Liang <[email protected]>
Date:   2016-04-15T20:18:25Z

    Fri Apr 15 13:18:25 PDT 2016

commit d59e4bfd991f56014a145bc2395d2d0a99420924
Author: Eric Liang <[email protected]>
Date:   2016-04-15T22:02:29Z

    cache blocking

commit 942fa5cf8a7e54451c29997be3a25ad00a753d70
Author: Eric Liang <[email protected]>
Date:   2016-04-15T22:18:36Z

    revert cache blocking (got slower)

commit 1a084284c1ba07cc224a97670df51d6229d32964
Author: Eric Liang <[email protected]>
Date:   2016-04-15T23:03:56Z

    Fri Apr 15 16:03:56 PDT 2016

commit 1a634c216672dee51251f245f13be413d667d607
Author: Eric Liang <[email protected]>
Date:   2016-04-16T00:43:08Z

    Fri Apr 15 17:43:08 PDT 2016

commit 54cfbae97ac00eb49ff4c80dbe2afbc9f0462b86
Author: Eric Liang <[email protected]>
Date:   2016-04-16T20:06:07Z

    Sat Apr 16 13:06:07 PDT 2016

commit 3924ae9862ab73b0b7ac6fadf24d2aff62b4601a
Author: Eric Liang <[email protected]>
Date:   2016-04-16T20:45:17Z

    Sat Apr 16 13:45:17 PDT 2016

commit 0d03792433e1de3a740bdf4c2e0d20a78af4956f
Author: Eric Liang <[email protected]>
Date:   2016-04-16T21:21:05Z

    Sat Apr 16 14:21:05 PDT 2016

commit ac30efe169c0559cb2f754cfc3ca0924f2775e2e
Author: Eric Liang <[email protected]>
Date:   2016-04-16T21:23:44Z

    Sat Apr 16 14:23:44 PDT 2016

commit 9e6c73051caa59ced822955c89ceb176c1e2ea2c
Author: Eric Liang <[email protected]>
Date:   2016-04-16T21:26:26Z

    Sat Apr 16 14:26:26 PDT 2016

commit 8cd0895137e9743651ba5c14e1359e7ed7bb0766
Author: Eric Liang <[email protected]>
Date:   2016-04-16T21:48:25Z

    Sat Apr 16 14:48:25 PDT 2016

commit a43b05a4be48e9d5f60035d3fb2d545ba3f94541
Author: Eric Liang <[email protected]>
Date:   2016-04-16T21:51:21Z

    Sat Apr 16 14:51:21 PDT 2016

commit b17f318ebffbe7a790ec40772b269b2b6cc558f7
Author: Eric Liang <[email protected]>
Date:   2016-04-17T04:19:07Z

    Sat Apr 16 21:19:07 PDT 2016

commit 8e9576a6315387b2eac8e5fb7b69451077e17a87
Author: Eric Liang <[email protected]>
Date:   2016-04-17T19:39:59Z

    Sun Apr 17 12:39:59 PDT 2016

commit d7879ec841381a0196c9630a540ae4041f83c801
Author: Eric Liang <[email protected]>
Date:   2016-04-17T19:57:25Z

    Sun Apr 17 12:57:25 PDT 2016

commit 5f6aaee264117073586440e6a2b50bb991239391
Author: Eric Liang <[email protected]>
Date:   2016-04-17T20:09:54Z

    Sun Apr 17 13:09:54 PDT 2016

commit 8be231238f78f6683caeb1fe21aad12dbc3c4df3
Author: Eric Liang <[email protected]>
Date:   2016-04-17T20:10:45Z

    Sun Apr 17 13:10:45 PDT 2016

commit 003ac95dd9273b4d0680372a875c22c9000c68e8
Author: Eric Liang <[email protected]>
Date:   2016-04-17T20:20:33Z

    Sun Apr 17 13:20:33 PDT 2016

commit c1c30b18d9993007f4e5fa1f90518d77345f5c38
Author: Eric Liang <[email protected]>
Date:   2016-04-17T20:26:26Z

    Sun Apr 17 13:26:26 PDT 2016

commit 364c23f0d5fe64c760701eebc0b8f4f756fbf7ab
Author: Eric Liang <[email protected]>
Date:   2016-04-17T20:26:56Z

    Sun Apr 17 13:26:56 PDT 2016

commit 3cee595dda5bf0d1a0fb870881bf695a22e82c0b
Author: Eric Liang <[email protected]>
Date:   2016-04-18T21:31:31Z

    move to unsafe

commit 31191f39494adf654eaad5df13b9ed76b04886e2
Author: Eric Liang <[email protected]>
Date:   2016-04-18T21:46:59Z

    clean up in mem shuffle path

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to