I confirmed also that the spill to disk _was_ occurring:

14/02/18 22:50:50 WARN collection.ExternalAppendOnlyMap: Spilling in-memory
map of 634 MB to disk (1 time so far)
14/02/18 22:50:50 WARN collection.ExternalAppendOnlyMap: Spilling in-memory
map of 581 MB to disk (1 time so far)


On Tue, Feb 18, 2014 at 8:07 PM, Andrew Ash <and...@andrewash.com> wrote:

> Hi dev list,
>
> I'm running into an issue where I'm seeing different results from Spark
> when I run with spark.shuffle.spill=false vs leaving it at the default
> (true).
>
> It's on internal data so I can't share my exact repro, but here's roughly
> what I'm doing:
>
> val rdd = sc.textFile(...)
>   .map(l => ... (col1, col2))  // parse CSV into Tuple2[String,String]
>   .distinct
>   .join(
>     sc.textFile(...)
>        .map(l => ... (col1, col2))  // parse CSV into Tuple2[String,String]
>        .distinct
>   )
>   .map{ case (k,(v1,v2)) => Seq(v1,k,v2).mkString("|") }
>
> Then I output:
> (rdd.count, rdd.distinct.count)
>
> When I run with spark.shuffle.spill=false I get this:
> (3192729,3192729)
>
> And with spark.shuffle.spill=true I get this:
> (3192931,3192726)
>
> Has anyone else seen any bugs in join-heavy operations while using
> spark.shuffle.spill=true?
>
> My current theory is that I have a hashcode collision between rows
> (unusual I know) and that the AppendOnlyMap does equality based on
> hashcode()+equals() and ExternalAppendOnlyMap does equality based just on
> hashcode().
>
> Would appreciate some additional eyes on this problem for sure.
>
> Right now I'm looking through the source and tests for AppendOnlyMap and
> ExternalAppendOnlyMap to see if anything jumps out at me.
>
> Thanks!
> Andrew
>

Reply via email to