GitHub user rxin opened a pull request:
https://github.com/apache/spark/pull/1242
[SPARK-2304] tera sort example program for shuffle benchmarks
This pull request adds an example program for benchmarking Spark shuffle.
It dynamically generates a set of 100 byte records according to the tera sort
spec, and repartitions the data based on an evenly spaced range partitioner. By
design, it does NOT yet perform sorting after the range partitioning yet.
Some of the code copied directly from Hadoop and simplified (the data
generator stuff).
I've used this utility to benchmark Spark at scale, including performing
100TB of shuffle in 12 mins on 290 nodes.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/rxin/spark terasort
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1242.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1242
----
commit adcae69145905162fa3b6932f70be2c932f95f87
Author: Reynold Xin <[email protected]>
Date: 2014-06-04T22:12:28Z
Added terasort data generator.
commit a4a5789824a7f74b690690a38ab085891b04b823
Author: Reynold Xin <[email protected]>
Date: 2014-06-04T22:55:58Z
Minor style fix.
commit 62c882fbabc526410d4f0689c22b58b504dbde52
Author: Reynold Xin <[email protected]>
Date: 2014-06-04T23:12:48Z
Added sorting.
commit b993face65f034907750d99439a613b6646ae5ff
Author: Reynold Xin <[email protected]>
Date: 2014-06-04T23:13:00Z
Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark into
terasort
commit 957efa554dbbd0c0be6629c2f9f0b096feafcefd
Author: Reynold Xin <[email protected]>
Date: 2014-06-05T00:26:20Z
Make data serializable with Kryo.
commit ec9a9bb2395e8b4da9913f4bd0fd262955909cb9
Author: Reynold Xin <[email protected]>
Date: 2014-06-05T06:08:10Z
Temporarily removed sorting, and reduced memory usage.
commit 676ca5229a2bb02b8764189622755533fdea7970
Author: Reynold Xin <[email protected]>
Date: 2014-06-27T04:16:03Z
Merge branch 'master' into terasort
Conflicts:
core/src/main/scala/org/apache/spark/Partitioner.scala
commit 7bfc7fc73e81ecb9fffb366ce683c5c4742dde7c
Author: Reynold Xin <[email protected]>
Date: 2014-06-27T05:03:06Z
Fixed header and used input size instead of num tuples.
commit 4efe7c7851afb24eb90e2646a687f316bc6a1e1d
Author: Reynold Xin <[email protected]>
Date: 2014-06-27T05:05:46Z
Revert RangePartitioner change.
commit f087068ba44dc2046728d4f1f81f1708fb09a540
Author: Reynold Xin <[email protected]>
Date: 2014-06-27T05:11:02Z
Style cleaned Unsigned16.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---