sunchao commented on a change in pull request #32875:
URL: https://github.com/apache/spark/pull/32875#discussion_r748487755



##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -171,6 +148,27 @@ trait Partitioning {
     required.requiredNumPartitions.forall(_ == numPartitions) && 
satisfies0(required)
   }
 
+  /**
+   * Creates a shuffle spec for this partitioning and its required 
distribution. The
+   * spec is used in the scenario where an operator has multiple children 
(e.g., join), and is
+   * used to decide whether this child is co-partitioned with others, 
therefore whether extra
+   * shuffle shall be introduced.
+   *
+   * @param defaultNumPartitions the default number of partitions to use when 
creating a new

Review comment:
       (sorry for the slight late reply, was on PTO)
   
   This parameter is for the partitionings that require a shuffle on itself 
even though its number of partitions is greater than the other side, in the 
incompatible case. For example, suppose we have the following query:
   ```sql
   SELECT * FROM A JOIN B ON A.c1 = B.d1 AND A.c2 = B.d2
   ```
   and `A` has partitioning `HashPartitioning(10)` and `B` has partitioning 
`RangePartitioning(20)` (omitted keys here for brevity). Suppose the value of 
`spark.sql.shuffle.partitions` is 5, then ideally we only want to shuffle the 
`B` side: we don't want to use 20 as `B`'s number of partitions which will let 
us pick `B` and end up re-shuffle both `A` and `B`. 
   
   On the other hand, if `spark.sql.shuffle.partitions` is 15 or 25, then we 
want to shuffle both sides. 
   
   Here, the `defaultNumPartitions` is used by `RangePartitioning` which 
subsequently is used to decide which side should be shuffled. It will also be 
used by `DataSourcePartitioning` for the same reason.
   
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to