(BTW this had a bug with negative hash codes in 1.1.0 so you should try branch-1.1 for it).
Matei > On Nov 3, 2014, at 6:28 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > > In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have > better performance while creating fewer files. So I'd suggest trying that too. > > Matei > >> On Nov 3, 2014, at 6:12 PM, Andrew Or <and...@databricks.com> wrote: >> >> Hey Matt, >> >> There's some prior work that compares consolidation performance on some >> medium-scale workload: >> http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf >> >> There we noticed about 2x performance degradation in the reduce phase on >> ext3. I am not aware of any other concrete numbers. Maybe others have more >> experiences to add. >> >> -Andrew >> >> 2014-11-03 17:26 GMT-08:00 Matt Cheah <mch...@palantir.com>: >> >>> Hi everyone, >>> >>> I'm running into more and more cases where too many files are opened when >>> spark.shuffle.consolidateFiles is turned off. >>> >>> I was wondering if this is a common scenario among the rest of the >>> community, and if so, if it is worth considering the setting to be turned >>> on by default. From the documentation, it seems like the performance could >>> be hurt on ext3 file systems. However, what are the concrete numbers of >>> performance degradation that is seen typically? A 2x slowdown in the >>> average job? 3x? Also, what cause the performance degradation on ext3 file >>> systems specifically? >>> >>> Thanks, >>> >>> -Matt Cheah >>> >>> >>> > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org