Looks like you have a large number of distinct keys. As you suspect, this maybe due to hash collisions, which only up to 4 billion. It could be related to this PR: https://github.com/apache/incubator-spark/pull/612.
The other thing is we had some issues with the behavior of arbitrary serialization/compression engines, and this is solved in the PR that Mridul referenced. What compression and serialization libraries are you using? 2014-02-18 20:56 GMT-08:00 Mridul Muralidharan <mri...@gmail.com>: > I had not resolved it in time for 0.9 - but IIRC there was a recent PR > which fixed bugs in spill [1] : are you able to reproduce this with > spark master ? > > Regards, > Mridul > > [1] https://github.com/apache/incubator-spark/pull/533 > > On Wed, Feb 19, 2014 at 9:58 AM, Andrew Ash <and...@andrewash.com> wrote: > > I confirmed also that the spill to disk _was_ occurring: > > > > 14/02/18 22:50:50 WARN collection.ExternalAppendOnlyMap: Spilling > in-memory > > map of 634 MB to disk (1 time so far) > > 14/02/18 22:50:50 WARN collection.ExternalAppendOnlyMap: Spilling > in-memory > > map of 581 MB to disk (1 time so far) > > > > > > On Tue, Feb 18, 2014 at 8:07 PM, Andrew Ash <and...@andrewash.com> > wrote: > > > >> Hi dev list, > >> > >> I'm running into an issue where I'm seeing different results from Spark > >> when I run with spark.shuffle.spill=false vs leaving it at the default > >> (true). > >> > >> It's on internal data so I can't share my exact repro, but here's > roughly > >> what I'm doing: > >> > >> val rdd = sc.textFile(...) > >> .map(l => ... (col1, col2)) // parse CSV into Tuple2[String,String] > >> .distinct > >> .join( > >> sc.textFile(...) > >> .map(l => ... (col1, col2)) // parse CSV into > Tuple2[String,String] > >> .distinct > >> ) > >> .map{ case (k,(v1,v2)) => Seq(v1,k,v2).mkString("|") } > >> > >> Then I output: > >> (rdd.count, rdd.distinct.count) > >> > >> When I run with spark.shuffle.spill=false I get this: > >> (3192729,3192729) > >> > >> And with spark.shuffle.spill=true I get this: > >> (3192931,3192726) > >> > >> Has anyone else seen any bugs in join-heavy operations while using > >> spark.shuffle.spill=true? > >> > >> My current theory is that I have a hashcode collision between rows > >> (unusual I know) and that the AppendOnlyMap does equality based on > >> hashcode()+equals() and ExternalAppendOnlyMap does equality based just > on > >> hashcode(). > >> > >> Would appreciate some additional eyes on this problem for sure. > >> > >> Right now I'm looking through the source and tests for AppendOnlyMap and > >> ExternalAppendOnlyMap to see if anything jumps out at me. > >> > >> Thanks! > >> Andrew > >> >