[GitHub] spark pull request #20091: [SPARK-22465][FOLLOWUP] Update the number of part...

mridulm Thu, 18 Jan 2018 23:29:25 -0800

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20091#discussion_r162552121
  
    --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
    @@ -67,31 +69,32 @@ object Partitioner {
           None
         }
     
    -    if (isEligiblePartitioner(hasMaxPartitioner, rdds)) {
    +    val defaultNumPartitions = if 
(rdd.context.conf.contains("spark.default.parallelism")) {
    +      rdd.context.defaultParallelism
    +    } else {
    +      rdds.map(_.partitions.length).max
    +    }
    +
    +    // If the existing max partitioner is an eligible one, or its 
partitions number is larger
    +    // than the default number of partitions, use the existing partitioner.
    +    if (hasMaxPartitioner.nonEmpty && 
(isEligiblePartitioner(hasMaxPartitioner.get, rdds) ||
    +        defaultNumPartitions < hasMaxPartitioner.get.getNumPartitions)) {
    --- End diff --
    
    > It depends on how you define "default".
    
    I dont see an ambiguity here - am I missing something ?
    To rephrase my point - this proposed PR has an impact only if user has 
explicitly set 'spark.default.parallelism' - else it is a noop.
    
    What is the concern here ? Users have set incorrect values for 
spark.default.parallelism ?
    
    > If we want to respect spark.default.parallelism strictly, we should not 
reuse partitioner at all.
    
    I agree with you - we should not have - except that ship has sailed long 
long time back - since atleast 0.5 this has been the behavior in spark - I dont 
have context before that.
    Historically, default parallelism was added later - using "largest 
partitioner if set or largest partition size when no partitioner is set" was 
the behavior. When default parallelism was introduced, probably (I guess) for 
backward compatible,  the behavior was continued.
    
    
    #20002 surgically fixed only the case when inferred partition size was off 
by atleast an order.
    When it is off by an order *and* if user has explicitly specified 
spark.default.parallelism, rely on user provided value - else preserve existing 
behavior.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20091: [SPARK-22465][FOLLOWUP] Update the number of part...

Reply via email to