[
https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985934#comment-13985934
]
Diana Carroll commented on SPARK-823:
-------------------------------------
Okay, this is definitely more than a documentation bug, because PySpark and
Scala work differently if spark.default.parallelism isn't set by the user. I'm
testing using wordcount.
Pyspark: reduceByKey will use the value of sc.defaultParallelism. That value
is set to the number of threads when running locally. On my Spark Standalone
"cluster" which has a single node with a single core, the value is 2. If I set
spark.default.parallelism, it will set sc.defaultParallelism to that value and
use that.
Scala: reduceByKey will use the number of partitions in my file/map stage and
ignore the value of sc.defaultParallelism. sc.defaultParallism is set by the
same logic as pyspark (number of threads for local, 2 for my microcluster), it
is just ignored.
I'm not sure which approach is correct. Scala works as described here:
http://spark.apache.org/docs/latest/tuning.html
{quote}
Spark automatically sets the number of “map” tasks to run on each file
according to its size (though you can control it through optional parameters to
SparkContext.textFile, etc), and for distributed “reduce” operations, such as
groupByKey and reduceByKey, it uses the largest parent RDD’s number of
partitions. You can pass the level of parallelism as a second argument (see the
spark.PairRDDFunctions documentation), or set the config property
spark.default.parallelism to change the default. In general, we recommend 2-3
tasks per CPU core in your cluster.
{quote}
> spark.default.parallelism's default is inconsistent across scheduler backends
> -----------------------------------------------------------------------------
>
> Key: SPARK-823
> URL: https://issues.apache.org/jira/browse/SPARK-823
> Project: Spark
> Issue Type: Bug
> Components: Documentation, Spark Core
> Affects Versions: 0.8.0, 0.7.3
> Reporter: Josh Rosen
> Priority: Minor
>
> The [0.7.3 configuration
> guide|http://spark-project.org/docs/latest/configuration.html] says that
> {{spark.default.parallelism}}'s default is 8, but the default is actually
> max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos
> scheduler, and {{threads}} for the local scheduler:
> https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157
> https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317
> https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150
> Should this be clarified in the documentation? Should the Mesos scheduler
> backend's default be revised?
--
This message was sent by Atlassian JIRA
(v6.2#6252)