I confirmed also that the spill to disk _was_ occurring: 14/02/18 22:50:50 WARN collection.ExternalAppendOnlyMap: Spilling in-memory map of 634 MB to disk (1 time so far) 14/02/18 22:50:50 WARN collection.ExternalAppendOnlyMap: Spilling in-memory map of 581 MB to disk (1 time so far)
On Tue, Feb 18, 2014 at 8:07 PM, Andrew Ash <and...@andrewash.com> wrote: > Hi dev list, > > I'm running into an issue where I'm seeing different results from Spark > when I run with spark.shuffle.spill=false vs leaving it at the default > (true). > > It's on internal data so I can't share my exact repro, but here's roughly > what I'm doing: > > val rdd = sc.textFile(...) > .map(l => ... (col1, col2)) // parse CSV into Tuple2[String,String] > .distinct > .join( > sc.textFile(...) > .map(l => ... (col1, col2)) // parse CSV into Tuple2[String,String] > .distinct > ) > .map{ case (k,(v1,v2)) => Seq(v1,k,v2).mkString("|") } > > Then I output: > (rdd.count, rdd.distinct.count) > > When I run with spark.shuffle.spill=false I get this: > (3192729,3192729) > > And with spark.shuffle.spill=true I get this: > (3192931,3192726) > > Has anyone else seen any bugs in join-heavy operations while using > spark.shuffle.spill=true? > > My current theory is that I have a hashcode collision between rows > (unusual I know) and that the AppendOnlyMap does equality based on > hashcode()+equals() and ExternalAppendOnlyMap does equality based just on > hashcode(). > > Would appreciate some additional eyes on this problem for sure. > > Right now I'm looking through the source and tests for AppendOnlyMap and > ExternalAppendOnlyMap to see if anything jumps out at me. > > Thanks! > Andrew >