For now, I’d recommend opening a PR against spark-perf.  It would be great to 
try to integrate this into the spark-perf harness so that I can run it 
automatically as part of Spark 1.2.0 release testing.  If you open a rough WIP 
PR over there, I’ll be able to provide some feedback to help you get it 
integrated into our benchmarking harness.

On November 11, 2014 at 12:52:52 PM, Ewan Higgs (ewan.hi...@ugent.be) wrote:

Shall I move the code to spark-perf then and submit a PR? Or shall I  
submit a PR to spark where it can remain an idiomatic example and we can  
clone it in spark-perf where it can potentially evolve non-idiomatic  
optimizations?  

Yours,  
Ewan  

On 11/11/2014 07:58 PM, Reynold Xin wrote:  
> This is great. I think the consensus from last time was that we would  
> put performance stuff into spark-perf, so it is easy to test different  
> Spark versions.  
>  
>  
> On Tue, Nov 11, 2014 at 5:03 AM, Ewan Higgs <ewan.hi...@ugent.be  
> <mailto:ewan.hi...@ugent.be>> wrote:  
>  
> Hi all,  
> I saw that Reynold Xin had a Terasort example PR on Github[1]. It  
> didn't appear to be similar to the Hadoop Terasort example, so  
> I've tried to brush it into shape so it can generate Terasort  
> files (teragen), sort the files (terasort) and validate the files  
> (teravalidate). My branch is available here:  
>  
> https://github.com/ehiggs/spark/tree/terasort  
>  
> With this code, you can run the following:  
>  
> # Generate 1M 100 byte records:  
> ./bin/run-example terasort.TeraGen 100M ~/data/terasort_in  
>  
> # Sort the file:  
> MASTER=local[4] ./bin/run-example terasort.TeraSort  
> ~/data/terasort_in ~/data/terasort_out  
>  
> # Validate the file  
> MASTER=local[4] ./bin/run-example terasort.TeraValidate  
> ~/data/terasort_out ~/data/terasort_validate  
>  
> # Validate that an unsorted file is indeed not correctly sorted:  
>  
> MASTER=local[4] ./bin/run-example terasort.TeraValidate  
> ~/data/terasort_in ~/data/terasort_validate_bad  
>  
> This matches the interface for the Hadoop version of Terasort,  
> except I added the ability to use K,M,G,T for record sizes in  
> TeraGen. This code therefore makes a good example of how to use  
> Spark, how to read and write Hadoop files, and also a way to test  
> some of the performance claims of Spark.  
>  
> > That's great, but why is this on the mailing list and not  
> submitted as a PR?  
>  
> I suspect there are some rough edges and I'd really appreciate  
> reviews. I would also like to know if others can try it out on  
> clusters and tell me if it's performing as it should.  
>  
> For example, I find it runs fine on my local machine, but when I  
> try to sort 100G of data on a cluster of 16 nodes, I get >2900  
> file splits. This really eats into the sort time.  
>  
> Another issue is that in TeraValidate, to work around SPARK-1018 I  
> had to clone each element. Does this /really/ need to be done?  
> It's pretty lame.  
>  
> In any event, I know the Spark 1.2 merge window closed on Friday  
> but as this is only for the examples directory maybe we can slip  
> it in if we can bash it into shape quickly enough?  
>  
> Anyway, thanks to everyone on #apache-spark and #scala who helped  
> me get through learning some rudimentary Scala to get this far.  
>  
> Yours,  
> Ewan Higgs  
>  
> [1] https://github.com/apache/spark/pull/1242  
>  
> ---------------------------------------------------------------------  
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org  
> <mailto:dev-unsubscr...@spark.apache.org>  
> For additional commands, e-mail: dev-h...@spark.apache.org  
> <mailto:dev-h...@spark.apache.org>  
>  
>  

Reply via email to