GitHub user aarondav opened a pull request:
https://github.com/apache/spark/pull/1138
SPARK-2203: PySpark defaults to use same num reduce partitions as map side
For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(),
PySpark will always assume that the default parallelism to use for the reduce
side is ctx.defaultParallelism, which is a constant typically determined by the
number of cores in cluster.
In contrast, Spark's Partitioner#defaultPartitioner will use the same
number of reduce partitions as map partitions unless the defaultParallelism
config is explicitly set. This tends to be a better default in order to avoid
OOMs, and should also be the behavior of PySpark.
JIRA: https://issues.apache.org/jira/browse/SPARK-2203
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/aarondav/spark pyfix
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1138.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1138
----
commit 1bd5751fad08b0b2c69f9a0816b6b20fa06621fe
Author: Aaron Davidson <[email protected]>
Date: 2014-06-19T19:43:50Z
SPARK-2203: PySpark defaults to use same num reduce partitions as map
partitions
For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(),
PySpark will always
assume that the default parallelism to use for the reduce side is
ctx.defaultParallelism,
which is a constant typically determined by the number of cores in cluster.
In contrast, Spark's Partitioner#defaultPartitioner will use the same
number of reduce
partitions as map partitions unless the defaultParallelism config is
explicitly set. This
tends to be a better default in order to avoid OOMs, and should also be the
behavior of PySpark.
JIRA: https://issues.apache.org/jira/browse/SPARK-2203
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---