I had not resolved it in time for 0.9 - but IIRC there was a recent PR which fixed bugs in spill [1] : are you able to reproduce this with spark master ?
Regards, Mridul [1] https://github.com/apache/incubator-spark/pull/533 On Wed, Feb 19, 2014 at 9:58 AM, Andrew Ash <and...@andrewash.com> wrote: > I confirmed also that the spill to disk _was_ occurring: > > 14/02/18 22:50:50 WARN collection.ExternalAppendOnlyMap: Spilling in-memory > map of 634 MB to disk (1 time so far) > 14/02/18 22:50:50 WARN collection.ExternalAppendOnlyMap: Spilling in-memory > map of 581 MB to disk (1 time so far) > > > On Tue, Feb 18, 2014 at 8:07 PM, Andrew Ash <and...@andrewash.com> wrote: > >> Hi dev list, >> >> I'm running into an issue where I'm seeing different results from Spark >> when I run with spark.shuffle.spill=false vs leaving it at the default >> (true). >> >> It's on internal data so I can't share my exact repro, but here's roughly >> what I'm doing: >> >> val rdd = sc.textFile(...) >> .map(l => ... (col1, col2)) // parse CSV into Tuple2[String,String] >> .distinct >> .join( >> sc.textFile(...) >> .map(l => ... (col1, col2)) // parse CSV into Tuple2[String,String] >> .distinct >> ) >> .map{ case (k,(v1,v2)) => Seq(v1,k,v2).mkString("|") } >> >> Then I output: >> (rdd.count, rdd.distinct.count) >> >> When I run with spark.shuffle.spill=false I get this: >> (3192729,3192729) >> >> And with spark.shuffle.spill=true I get this: >> (3192931,3192726) >> >> Has anyone else seen any bugs in join-heavy operations while using >> spark.shuffle.spill=true? >> >> My current theory is that I have a hashcode collision between rows >> (unusual I know) and that the AppendOnlyMap does equality based on >> hashcode()+equals() and ExternalAppendOnlyMap does equality based just on >> hashcode(). >> >> Would appreciate some additional eyes on this problem for sure. >> >> Right now I'm looking through the source and tests for AppendOnlyMap and >> ExternalAppendOnlyMap to see if anything jumps out at me. >> >> Thanks! >> Andrew >>