Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/20091#discussion_r162552121
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -67,31 +69,32 @@ object Partitioner {
None
}
- if (isEligiblePartitioner(hasMaxPartitioner, rdds)) {
+ val defaultNumPartitions = if
(rdd.context.conf.contains("spark.default.parallelism")) {
+ rdd.context.defaultParallelism
+ } else {
+ rdds.map(_.partitions.length).max
+ }
+
+ // If the existing max partitioner is an eligible one, or its
partitions number is larger
+ // than the default number of partitions, use the existing partitioner.
+ if (hasMaxPartitioner.nonEmpty &&
(isEligiblePartitioner(hasMaxPartitioner.get, rdds) ||
+ defaultNumPartitions < hasMaxPartitioner.get.getNumPartitions)) {
--- End diff --
> It depends on how you define "default".
I dont see an ambiguity here - am I missing something ?
To rephrase my point - this proposed PR has an impact only if user has
explicitly set 'spark.default.parallelism' - else it is a noop.
What is the concern here ? Users have set incorrect values for
spark.default.parallelism ?
> If we want to respect spark.default.parallelism strictly, we should not
reuse partitioner at all.
I agree with you - we should not have - except that ship has sailed long
long time back - since atleast 0.5 this has been the behavior in spark - I dont
have context before that.
Historically, default parallelism was added later - using "largest
partitioner if set or largest partition size when no partitioner is set" was
the behavior. When default parallelism was introduced, probably (I guess) for
backward compatible, the behavior was continued.
#20002 surgically fixed only the case when inferred partition size was off
by atleast an order.
When it is off by an order *and* if user has explicitly specified
spark.default.parallelism, rely on user provided value - else preserve existing
behavior.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]