sunchao commented on a change in pull request #32875:
URL: https://github.com/apache/spark/pull/32875#discussion_r748487755
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -171,6 +148,27 @@ trait Partitioning {
required.requiredNumPartitions.forall(_ == numPartitions) &&
satisfies0(required)
}
+ /**
+ * Creates a shuffle spec for this partitioning and its required
distribution. The
+ * spec is used in the scenario where an operator has multiple children
(e.g., join), and is
+ * used to decide whether this child is co-partitioned with others,
therefore whether extra
+ * shuffle shall be introduced.
+ *
+ * @param defaultNumPartitions the default number of partitions to use when
creating a new
Review comment:
(sorry for the slight late reply, was on PTO)
This parameter is for the partitionings that require a shuffle on itself
even though its number of partitions is greater than the other side, in the
incompatible case. For example, suppose we have the following query:
```sql
SELECT * FROM A JOIN B ON A.c1 = B.d1 AND A.c2 = B.d2
```
and `A` has partitioning `HashPartitioning(10)` and `B` has partitioning
`RangePartitioning(20)` (omitted keys here for brevity). Suppose the value of
`spark.sql.shuffle.partitions` is 5, then ideally we only want to shuffle the
`B` side: we don't want to use 20 as `B`'s number of partitions which will let
us pick `B` and end up re-shuffle both `A` and `B`.
On the other hand, if `spark.sql.shuffle.partitions` is 15 or 25, then we
want to shuffle both sides.
Here, the `defaultNumPartitions` is used by `RangePartitioning` which
subsequently is used to decide which side should be shuffled. It will also be
used by `DataSourcePartitioning` for the same reason.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]