Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/20091#discussion_r162778187
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -43,17 +43,19 @@ object Partitioner {
/**
* Choose a partitioner to use for a cogroup-like operation between a
number of RDDs.
*
- * If any of the RDDs already has a partitioner, and the number of
partitions of the
- * partitioner is either greater than or is less than and within a
single order of
- * magnitude of the max number of upstream partitions, choose that one.
+ * If spark.default.parallelism is set, we'll use the value of
SparkContext defaultParallelism
+ * as the default partitions number, otherwise we'll use the max number
of upstream partitions.
*
- * Otherwise, we use a default HashPartitioner. For the number of
partitions, if
- * spark.default.parallelism is set, then we'll use the value from
SparkContext
- * defaultParallelism, otherwise we'll use the max number of upstream
partitions.
+ * If any of the RDDs already has a partitioner, and the partitioner is
an eligible one (with a
+ * partitions number that is not less than the max number of upstream
partitions by an order of
+ * magnitude), or the number of partitions is larger than the default
one, we'll choose the
+ * exsiting partitioner.
--- End diff --
We should rephrase this for clarity.
How about
"When available, we choose the partitioner from rdds with maximum number of
partitions. If this partitioner is eligible (number of partitions within an
order of maximum number of partitions in rdds), or has partition number higher
than default partitions number - we use this partitioner"
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]