[GitHub] spark pull request: [SPARK-2304] tera sort example program for shu...

rxin Thu, 26 Jun 2014 22:15:07 -0700

GitHub user rxin opened a pull request:

    https://github.com/apache/spark/pull/1242


    [SPARK-2304] tera sort example program for shuffle benchmarks

    This pull request adds an example program for benchmarking Spark shuffle. 
It dynamically generates a set of 100 byte records according to the tera sort 
spec, and repartitions the data based on an evenly spaced range partitioner. By 
design, it does NOT yet perform sorting after the range partitioning yet.
    
    Some of the code copied directly from Hadoop and simplified (the data 
generator stuff).
    
    I've used this utility to benchmark Spark at scale, including performing 
100TB of shuffle in 12 mins on 290 nodes.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rxin/spark terasort

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1242.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1242
    
----
commit adcae69145905162fa3b6932f70be2c932f95f87
Author: Reynold Xin <[email protected]>
Date:   2014-06-04T22:12:28Z

    Added terasort data generator.

commit a4a5789824a7f74b690690a38ab085891b04b823
Author: Reynold Xin <[email protected]>
Date:   2014-06-04T22:55:58Z

    Minor style fix.

commit 62c882fbabc526410d4f0689c22b58b504dbde52
Author: Reynold Xin <[email protected]>
Date:   2014-06-04T23:12:48Z

    Added sorting.

commit b993face65f034907750d99439a613b6646ae5ff
Author: Reynold Xin <[email protected]>
Date:   2014-06-04T23:13:00Z

    Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark into 
terasort

commit 957efa554dbbd0c0be6629c2f9f0b096feafcefd
Author: Reynold Xin <[email protected]>
Date:   2014-06-05T00:26:20Z

    Make data serializable with Kryo.

commit ec9a9bb2395e8b4da9913f4bd0fd262955909cb9
Author: Reynold Xin <[email protected]>
Date:   2014-06-05T06:08:10Z

    Temporarily removed sorting, and reduced memory usage.

commit 676ca5229a2bb02b8764189622755533fdea7970
Author: Reynold Xin <[email protected]>
Date:   2014-06-27T04:16:03Z

    Merge branch 'master' into terasort
    
    Conflicts:
        core/src/main/scala/org/apache/spark/Partitioner.scala

commit 7bfc7fc73e81ecb9fffb366ce683c5c4742dde7c
Author: Reynold Xin <[email protected]>
Date:   2014-06-27T05:03:06Z

    Fixed header and used input size instead of num tuples.

commit 4efe7c7851afb24eb90e2646a687f316bc6a1e1d
Author: Reynold Xin <[email protected]>
Date:   2014-06-27T05:05:46Z

    Revert RangePartitioner change.

commit f087068ba44dc2046728d4f1f81f1708fb09a540
Author: Reynold Xin <[email protected]>
Date:   2014-06-27T05:11:02Z

    Style cleaned Unsigned16.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2304] tera sort example program for shu...

Reply via email to