Terasort example

Ewan Higgs Tue, 11 Nov 2014 05:04:58 -0800

Hi all,

I saw that Reynold Xin had a Terasort example PR on Github[1]. It didn'tappear to be similar to the Hadoop Terasort example, so I've tried tobrush it into shape so it can generate Terasort files (teragen), sortthe files (terasort) and validate the files (teravalidate). My branch isavailable here:


https://github.com/ehiggs/spark/tree/terasort

With this code, you can run the following:

# Generate 1M 100 byte records:
 ./bin/run-example terasort.TeraGen 100M ~/data/terasort_in

# Sort the file:

MASTER=local[4] ./bin/run-example terasort.TeraSort ~/data/terasort_in~/data/terasort_out


# Validate the file

MASTER=local[4] ./bin/run-example terasort.TeraValidate~/data/terasort_out ~/data/terasort_validate


# Validate that an unsorted file is indeed not correctly sorted:

MASTER=local[4] ./bin/run-example terasort.TeraValidate~/data/terasort_in ~/data/terasort_validate_bad

This matches the interface for the Hadoop version of Terasort, except Iadded the ability to use K,M,G,T for record sizes in TeraGen. This codetherefore makes a good example of how to use Spark, how to read andwrite Hadoop files, and also a way to test some of the performanceclaims of Spark.

> That's great, but why is this on the mailing list and not submittedas a PR?

I suspect there are some rough edges and I'd really appreciate reviews.I would also like to know if others can try it out on clusters and tellme if it's performing as it should.

For example, I find it runs fine on my local machine, but when I try tosort 100G of data on a cluster of 16 nodes, I get >2900 file splits.This really eats into the sort time.

Another issue is that in TeraValidate, to work around SPARK-1018 I hadto clone each element. Does this /really/ need to be done? It's pretty lame.

In any event, I know the Spark 1.2 merge window closed on Friday but asthis is only for the examples directory maybe we can slip it in if wecan bash it into shape quickly enough?

Anyway, thanks to everyone on #apache-spark and #scala who helped me getthrough learning some rudimentary Scala to get this far.


Yours,
Ewan Higgs

[1] https://github.com/apache/spark/pull/1242

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Terasort example

Reply via email to