Re: Dataframe Partitioning

Michael Armbrust Tue, 01 Mar 2016 15:44:40 -0800

If you have to pick a number, its better to over estimate than
underestimate since task launching in spark is relatively cheap compared to
spilling to disk or OOMing (now much less likely due to Tungsten).
Eventually, we plan to make this dynamic, but you should tune for your
particular workload.


On Tue, Mar 1, 2016 at 3:19 PM, Teng Liao <tl...@palantir.com> wrote:

> Hi,
>
> I was wondering what the rationale behind defaulting all repartitioning to
> spark.sql.shuffle.partitions is. I’m seeing a huge overhead when running a
> job whose input partitions is 2 and, using the default value for
> spark.sql.shuffle.partitions, this is now 200. Thanks.
>
> -Teng Fei Liao
>

Re: Dataframe Partitioning

Reply via email to