Hi all,
I saw that Reynold Xin had a Terasort example PR on Github[1]. It didn't
appear to be similar to the Hadoop Terasort example, so I've tried to
brush it into shape so it can generate Terasort files (teragen), sort
the files (terasort) and validate the files (teravalidate). My branch is
available here:
https://github.com/ehiggs/spark/tree/terasort
With this code, you can run the following:
# Generate 1M 100 byte records:
./bin/run-example terasort.TeraGen 100M ~/data/terasort_in
# Sort the file:
MASTER=local[4] ./bin/run-example terasort.TeraSort ~/data/terasort_in
~/data/terasort_out
# Validate the file
MASTER=local[4] ./bin/run-example terasort.TeraValidate
~/data/terasort_out ~/data/terasort_validate
# Validate that an unsorted file is indeed not correctly sorted:
MASTER=local[4] ./bin/run-example terasort.TeraValidate
~/data/terasort_in ~/data/terasort_validate_bad
This matches the interface for the Hadoop version of Terasort, except I
added the ability to use K,M,G,T for record sizes in TeraGen. This code
therefore makes a good example of how to use Spark, how to read and
write Hadoop files, and also a way to test some of the performance
claims of Spark.
> That's great, but why is this on the mailing list and not submitted
as a PR?
I suspect there are some rough edges and I'd really appreciate reviews.
I would also like to know if others can try it out on clusters and tell
me if it's performing as it should.
For example, I find it runs fine on my local machine, but when I try to
sort 100G of data on a cluster of 16 nodes, I get >2900 file splits.
This really eats into the sort time.
Another issue is that in TeraValidate, to work around SPARK-1018 I had
to clone each element. Does this /really/ need to be done? It's pretty lame.
In any event, I know the Spark 1.2 merge window closed on Friday but as
this is only for the examples directory maybe we can slip it in if we
can bash it into shape quickly enough?
Anyway, thanks to everyone on #apache-spark and #scala who helped me get
through learning some rudimentary Scala to get this far.
Yours,
Ewan Higgs
[1] https://github.com/apache/spark/pull/1242
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org