Re: Spark shuffle consolidateFiles performance degradation numbers

Matei Zaharia Mon, 03 Nov 2014 18:29:43 -0800

(BTW this had a bug with negative hash codes in 1.1.0 so you should try 
branch-1.1 for it).


Matei

> On Nov 3, 2014, at 6:28 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:
> 
> In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have 
> better performance while creating fewer files. So I'd suggest trying that too.
> 
> Matei
> 
>> On Nov 3, 2014, at 6:12 PM, Andrew Or <and...@databricks.com> wrote:
>> 
>> Hey Matt,
>> 
>> There's some prior work that compares consolidation performance on some
>> medium-scale workload:
>> http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
>> 
>> There we noticed about 2x performance degradation in the reduce phase on
>> ext3. I am not aware of any other concrete numbers. Maybe others have more
>> experiences to add.
>> 
>> -Andrew
>> 
>> 2014-11-03 17:26 GMT-08:00 Matt Cheah <mch...@palantir.com>:
>> 
>>> Hi everyone,
>>> 
>>> I'm running into more and more cases where too many files are opened when
>>> spark.shuffle.consolidateFiles is turned off.
>>> 
>>> I was wondering if this is a common scenario among the rest of the
>>> community, and if so, if it is worth considering the setting to be turned
>>> on by default. From the documentation, it seems like the performance could
>>> be hurt on ext3 file systems. However, what are the concrete numbers of
>>> performance degradation that is seen typically? A 2x slowdown in the
>>> average job? 3x? Also, what cause the performance degradation on ext3 file
>>> systems specifically?
>>> 
>>> Thanks,
>>> 
>>> -Matt Cheah
>>> 
>>> 
>>> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spark shuffle consolidateFiles performance degradation numbers

Reply via email to