For now, I’d recommend opening a PR against spark-perf. It would be great to try to integrate this into the spark-perf harness so that I can run it automatically as part of Spark 1.2.0 release testing. If you open a rough WIP PR over there, I’ll be able to provide some feedback to help you get it integrated into our benchmarking harness.
On November 11, 2014 at 12:52:52 PM, Ewan Higgs (ewan.hi...@ugent.be) wrote: Shall I move the code to spark-perf then and submit a PR? Or shall I submit a PR to spark where it can remain an idiomatic example and we can clone it in spark-perf where it can potentially evolve non-idiomatic optimizations? Yours, Ewan On 11/11/2014 07:58 PM, Reynold Xin wrote: > This is great. I think the consensus from last time was that we would > put performance stuff into spark-perf, so it is easy to test different > Spark versions. > > > On Tue, Nov 11, 2014 at 5:03 AM, Ewan Higgs <ewan.hi...@ugent.be > <mailto:ewan.hi...@ugent.be>> wrote: > > Hi all, > I saw that Reynold Xin had a Terasort example PR on Github[1]. It > didn't appear to be similar to the Hadoop Terasort example, so > I've tried to brush it into shape so it can generate Terasort > files (teragen), sort the files (terasort) and validate the files > (teravalidate). My branch is available here: > > https://github.com/ehiggs/spark/tree/terasort > > With this code, you can run the following: > > # Generate 1M 100 byte records: > ./bin/run-example terasort.TeraGen 100M ~/data/terasort_in > > # Sort the file: > MASTER=local[4] ./bin/run-example terasort.TeraSort > ~/data/terasort_in ~/data/terasort_out > > # Validate the file > MASTER=local[4] ./bin/run-example terasort.TeraValidate > ~/data/terasort_out ~/data/terasort_validate > > # Validate that an unsorted file is indeed not correctly sorted: > > MASTER=local[4] ./bin/run-example terasort.TeraValidate > ~/data/terasort_in ~/data/terasort_validate_bad > > This matches the interface for the Hadoop version of Terasort, > except I added the ability to use K,M,G,T for record sizes in > TeraGen. This code therefore makes a good example of how to use > Spark, how to read and write Hadoop files, and also a way to test > some of the performance claims of Spark. > > > That's great, but why is this on the mailing list and not > submitted as a PR? > > I suspect there are some rough edges and I'd really appreciate > reviews. I would also like to know if others can try it out on > clusters and tell me if it's performing as it should. > > For example, I find it runs fine on my local machine, but when I > try to sort 100G of data on a cluster of 16 nodes, I get >2900 > file splits. This really eats into the sort time. > > Another issue is that in TeraValidate, to work around SPARK-1018 I > had to clone each element. Does this /really/ need to be done? > It's pretty lame. > > In any event, I know the Spark 1.2 merge window closed on Friday > but as this is only for the examples directory maybe we can slip > it in if we can bash it into shape quickly enough? > > Anyway, thanks to everyone on #apache-spark and #scala who helped > me get through learning some rudimentary Scala to get this far. > > Yours, > Ewan Higgs > > [1] https://github.com/apache/spark/pull/1242 > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > <mailto:dev-unsubscr...@spark.apache.org> > For additional commands, e-mail: dev-h...@spark.apache.org > <mailto:dev-h...@spark.apache.org> > >