Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Zach Fry
Hey Andrew, Matei,

Thanks for responding.

For some more context, we were running into "Too many open files" issues
where we were seeing this happen immediately after the Collect phase
(about 30 seconds into a run) on a decently sized dataset (14 MM rows).
The ulimit set in the spark-env was 256,000 which we believe should have
been enough, but even with it set at that number, we were still seeing
issues. 
Can you comment on what a "good" ulimit should be in these cases?

We believe what might have caused this is  some process got orphaned
without cleaning up its open file handles.
However, other than anecdotal evidence and some speculation, we don't have
much evidence to expand on this further.

We were wondering if we could get some more information about how many
files get opened during a shuffle.
We discussed that it is going to be around N x M, where N is the number of
Tasks and M is the number of Reducers.
Does this sound about right?


Are there any other considerations we should be aware of when setting
consolidateFiles to True?

Thanks, 
Zach Fry
Palantir | Developer Support Engineer
z...@palantir.com  | 650.226.6338



On 11/3/14 6:28 09PM, "Matei Zaharia"  wrote:

>In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will
>have better performance while creating fewer files. So I'd suggest trying
>that too.
>
>Matei
>
>> On Nov 3, 2014, at 6:12 PM, Andrew Or  wrote:
>> 
>> Hey Matt,
>> 
>> There's some prior work that compares consolidation performance on some
>> medium-scale workload:
>> 
>>https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cs.berkeley.edu_-
>>7Ekubitron_courses_cs262a-2DF13_projects_reports_project16-5Freport.pdf&d
>>=AAIFAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=0Yj0NJdi423O9rGnW
>>Dox5yE_2OXftYbKeoFygDwj99U&m=fQgGKwxzg3lfq5XUaEZy674jjtWDSrFOHIrIDFEGpQc&
>>s=ukSpYSbxzzrYdHJEXPMx3gGsErP2vA2PMdBVsY3EOnA&e=
>> 
>> There we noticed about 2x performance degradation in the reduce phase on
>> ext3. I am not aware of any other concrete numbers. Maybe others have
>>more
>> experiences to add.
>> 
>> -Andrew
>> 
>> 2014-11-03 17:26 GMT-08:00 Matt Cheah :
>> 
>>> Hi everyone,
>>> 
>>> I'm running into more and more cases where too many files are opened
>>>when
>>> spark.shuffle.consolidateFiles is turned off.
>>> 
>>> I was wondering if this is a common scenario among the rest of the
>>> community, and if so, if it is worth considering the setting to be
>>>turned
>>> on by default. From the documentation, it seems like the performance
>>>could
>>> be hurt on ext3 file systems. However, what are the concrete numbers of
>>> performance degradation that is seen typically? A 2x slowdown in the
>>> average job? 3x? Also, what cause the performance degradation on ext3
>>>file
>>> systems specifically?
>>> 
>>> Thanks,
>>> 
>>> -Matt Cheah
>>> 
>>> 
>>> 
>


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matei Zaharia
(BTW this had a bug with negative hash codes in 1.1.0 so you should try 
branch-1.1 for it).

Matei

> On Nov 3, 2014, at 6:28 PM, Matei Zaharia  wrote:
> 
> In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have 
> better performance while creating fewer files. So I'd suggest trying that too.
> 
> Matei
> 
>> On Nov 3, 2014, at 6:12 PM, Andrew Or  wrote:
>> 
>> Hey Matt,
>> 
>> There's some prior work that compares consolidation performance on some
>> medium-scale workload:
>> http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
>> 
>> There we noticed about 2x performance degradation in the reduce phase on
>> ext3. I am not aware of any other concrete numbers. Maybe others have more
>> experiences to add.
>> 
>> -Andrew
>> 
>> 2014-11-03 17:26 GMT-08:00 Matt Cheah :
>> 
>>> Hi everyone,
>>> 
>>> I'm running into more and more cases where too many files are opened when
>>> spark.shuffle.consolidateFiles is turned off.
>>> 
>>> I was wondering if this is a common scenario among the rest of the
>>> community, and if so, if it is worth considering the setting to be turned
>>> on by default. From the documentation, it seems like the performance could
>>> be hurt on ext3 file systems. However, what are the concrete numbers of
>>> performance degradation that is seen typically? A 2x slowdown in the
>>> average job? 3x? Also, what cause the performance degradation on ext3 file
>>> systems specifically?
>>> 
>>> Thanks,
>>> 
>>> -Matt Cheah
>>> 
>>> 
>>> 
> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matei Zaharia
In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have 
better performance while creating fewer files. So I'd suggest trying that too.

Matei

> On Nov 3, 2014, at 6:12 PM, Andrew Or  wrote:
> 
> Hey Matt,
> 
> There's some prior work that compares consolidation performance on some
> medium-scale workload:
> http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
> 
> There we noticed about 2x performance degradation in the reduce phase on
> ext3. I am not aware of any other concrete numbers. Maybe others have more
> experiences to add.
> 
> -Andrew
> 
> 2014-11-03 17:26 GMT-08:00 Matt Cheah :
> 
>> Hi everyone,
>> 
>> I'm running into more and more cases where too many files are opened when
>> spark.shuffle.consolidateFiles is turned off.
>> 
>> I was wondering if this is a common scenario among the rest of the
>> community, and if so, if it is worth considering the setting to be turned
>> on by default. From the documentation, it seems like the performance could
>> be hurt on ext3 file systems. However, what are the concrete numbers of
>> performance degradation that is seen typically? A 2x slowdown in the
>> average job? 3x? Also, what cause the performance degradation on ext3 file
>> systems specifically?
>> 
>> Thanks,
>> 
>> -Matt Cheah
>> 
>> 
>> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Andrew Or
Hey Matt,

There's some prior work that compares consolidation performance on some
medium-scale workload:
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf

There we noticed about 2x performance degradation in the reduce phase on
ext3. I am not aware of any other concrete numbers. Maybe others have more
experiences to add.

-Andrew

2014-11-03 17:26 GMT-08:00 Matt Cheah :

> Hi everyone,
>
> I'm running into more and more cases where too many files are opened when
> spark.shuffle.consolidateFiles is turned off.
>
> I was wondering if this is a common scenario among the rest of the
> community, and if so, if it is worth considering the setting to be turned
> on by default. From the documentation, it seems like the performance could
> be hurt on ext3 file systems. However, what are the concrete numbers of
> performance degradation that is seen typically? A 2x slowdown in the
> average job? 3x? Also, what cause the performance degradation on ext3 file
> systems specifically?
>
> Thanks,
>
> -Matt Cheah
>
>
>


Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matt Cheah
Hi everyone,

I'm running into more and more cases where too many files are opened when
spark.shuffle.consolidateFiles is turned off.

I was wondering if this is a common scenario among the rest of the
community, and if so, if it is worth considering the setting to be turned on
by default. From the documentation, it seems like the performance could be
hurt on ext3 file systems. However, what are the concrete numbers of
performance degradation that is seen typically? A 2x slowdown in the average
job? 3x? Also, what cause the performance degradation on ext3 file systems
specifically?

Thanks,

-Matt Cheah






smime.p7s
Description: S/MIME cryptographic signature