Aaron Davidson created SPARK-2203:
-------------------------------------

             Summary: PySpark does not infer default numPartitions in same way 
as Spark
                 Key: SPARK-2203
                 URL: https://issues.apache.org/jira/browse/SPARK-2203
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.0.0
            Reporter: Aaron Davidson
            Assignee: Aaron Davidson


For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark 
will always assume that the default parallelism to use for the reduce side is 
ctx.defaultParallelism, which is a constant typically determined by the number 
of cores in cluster.

In contrast, Spark's Partitioner#defaultPartitioner will use the same number of 
reduce partitions as map partitions unless the defaultParallelism config is 
explicitly set. This tends to be a better default in order to avoid OOMs, and 
should also be the behavior of PySpark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to