If you have to pick a number, its better to over estimate than underestimate since task launching in spark is relatively cheap compared to spilling to disk or OOMing (now much less likely due to Tungsten). Eventually, we plan to make this dynamic, but you should tune for your particular workload.
On Tue, Mar 1, 2016 at 3:19 PM, Teng Liao <tl...@palantir.com> wrote: > Hi, > > I was wondering what the rationale behind defaulting all repartitioning to > spark.sql.shuffle.partitions is. I’m seeing a huge overhead when running a > job whose input partitions is 2 and, using the default value for > spark.sql.shuffle.partitions, this is now 200. Thanks. > > -Teng Fei Liao >