Aaron Davidson created SPARK-2203:
-------------------------------------
Summary: PySpark does not infer default numPartitions in same way
as Spark
Key: SPARK-2203
URL: https://issues.apache.org/jira/browse/SPARK-2203
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark
will always assume that the default parallelism to use for the reduce side is
ctx.defaultParallelism, which is a constant typically determined by the number
of cores in cluster.
In contrast, Spark's Partitioner#defaultPartitioner will use the same number of
reduce partitions as map partitions unless the defaultParallelism config is
explicitly set. This tends to be a better default in order to avoid OOMs, and
should also be the behavior of PySpark.
--
This message was sent by Atlassian JIRA
(v6.2#6252)