Re: Bug in spark.shuffle.spill setting? (0.9.0)

Mridul Muralidharan Tue, 18 Feb 2014 20:58:19 -0800

I had not resolved it in time for 0.9 - but IIRC there was a recent PR
which fixed bugs in spill [1] : are you able to reproduce this with
spark master ?


Regards,
Mridul

[1] https://github.com/apache/incubator-spark/pull/533

On Wed, Feb 19, 2014 at 9:58 AM, Andrew Ash <and...@andrewash.com> wrote:
> I confirmed also that the spill to disk _was_ occurring:
>
> 14/02/18 22:50:50 WARN collection.ExternalAppendOnlyMap: Spilling in-memory
> map of 634 MB to disk (1 time so far)
> 14/02/18 22:50:50 WARN collection.ExternalAppendOnlyMap: Spilling in-memory
> map of 581 MB to disk (1 time so far)
>
>
> On Tue, Feb 18, 2014 at 8:07 PM, Andrew Ash <and...@andrewash.com> wrote:
>
>> Hi dev list,
>>
>> I'm running into an issue where I'm seeing different results from Spark
>> when I run with spark.shuffle.spill=false vs leaving it at the default
>> (true).
>>
>> It's on internal data so I can't share my exact repro, but here's roughly
>> what I'm doing:
>>
>> val rdd = sc.textFile(...)
>>   .map(l => ... (col1, col2))  // parse CSV into Tuple2[String,String]
>>   .distinct
>>   .join(
>>     sc.textFile(...)
>>        .map(l => ... (col1, col2))  // parse CSV into Tuple2[String,String]
>>        .distinct
>>   )
>>   .map{ case (k,(v1,v2)) => Seq(v1,k,v2).mkString("|") }
>>
>> Then I output:
>> (rdd.count, rdd.distinct.count)
>>
>> When I run with spark.shuffle.spill=false I get this:
>> (3192729,3192729)
>>
>> And with spark.shuffle.spill=true I get this:
>> (3192931,3192726)
>>
>> Has anyone else seen any bugs in join-heavy operations while using
>> spark.shuffle.spill=true?
>>
>> My current theory is that I have a hashcode collision between rows
>> (unusual I know) and that the AppendOnlyMap does equality based on
>> hashcode()+equals() and ExternalAppendOnlyMap does equality based just on
>> hashcode().
>>
>> Would appreciate some additional eyes on this problem for sure.
>>
>> Right now I'm looking through the source and tests for AppendOnlyMap and
>> ExternalAppendOnlyMap to see if anything jumps out at me.
>>
>> Thanks!
>> Andrew
>>

Re: Bug in spark.shuffle.spill setting? (0.9.0)

Reply via email to