Hi all,
I saw that Reynold Xin had a Terasort example PR on Github[1]. It didn't appear to be similar to the Hadoop Terasort example, so I've tried to brush it into shape so it can generate Terasort files (teragen), sort the files (terasort) and validate the files (teravalidate). My branch is available here:

https://github.com/ehiggs/spark/tree/terasort

With this code, you can run the following:

# Generate 1M 100 byte records:
 ./bin/run-example terasort.TeraGen 100M ~/data/terasort_in

# Sort the file:
MASTER=local[4] ./bin/run-example terasort.TeraSort ~/data/terasort_in ~/data/terasort_out

# Validate the file
MASTER=local[4] ./bin/run-example terasort.TeraValidate ~/data/terasort_out ~/data/terasort_validate

# Validate that an unsorted file is indeed not correctly sorted:

MASTER=local[4] ./bin/run-example terasort.TeraValidate ~/data/terasort_in ~/data/terasort_validate_bad

This matches the interface for the Hadoop version of Terasort, except I added the ability to use K,M,G,T for record sizes in TeraGen. This code therefore makes a good example of how to use Spark, how to read and write Hadoop files, and also a way to test some of the performance claims of Spark.

> That's great, but why is this on the mailing list and not submitted as a PR?

I suspect there are some rough edges and I'd really appreciate reviews. I would also like to know if others can try it out on clusters and tell me if it's performing as it should.

For example, I find it runs fine on my local machine, but when I try to sort 100G of data on a cluster of 16 nodes, I get >2900 file splits. This really eats into the sort time.

Another issue is that in TeraValidate, to work around SPARK-1018 I had to clone each element. Does this /really/ need to be done? It's pretty lame.

In any event, I know the Spark 1.2 merge window closed on Friday but as this is only for the examples directory maybe we can slip it in if we can bash it into shape quickly enough?

Anyway, thanks to everyone on #apache-spark and #scala who helped me get through learning some rudimentary Scala to get this far.

Yours,
Ewan Higgs

[1] https://github.com/apache/spark/pull/1242

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to