maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL]
improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r382666207
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
##########
@@ -34,6 +34,29 @@ import
org.apache.spark.sql.execution.exchange.{EnsureRequirements, ShuffleExcha
import org.apache.spark.sql.execution.joins.SortMergeJoinExec
import org.apache.spark.sql.internal.SQLConf
+/**
+ * A rule to optimize skewed joins to avoid one or a little tasks processing
most of the data.
+ *
+ * The general idea is to treat a Sort-Merge join as many sub-joins that each
sub-join processes
+ * the data of a left-side partition and a right-side partition. For each
sub-join, split the skewed
+ * partition into sub-partitions and do a cartesian product of sub-partitions
from left and
+ * right sides.
+ *
+ * For example, assume the Sort-Merge join has 4 partitions:
+ * left: [L1, L2, L3, L4]
+ * right: [R1, R2, R3, R4]
+ *
+ * Let's say L2, L4 and R3, R4 are skewed, and each of them get split into 2
sub-partitions. This
+ * has 4 sub-joins at the beginning: (L1, R1), (L2, R2), (L2, R2), (L2, R2).
+ * This rule expands it to 9 sub-joins:
+ * (L1, R1),
+ * (L2-1, R2), (L2-2, R2),
+ * (L3, R3-1), (L3, R3-2),
+ * (L4-1, R4-1), (L4-2, R4-1), (L4-1, R4-2), (L4-2, R4-2)
+ * Each sub-join is executed as a Spark task physically, so we end up with
more parallelism.
+ *
+ * Note that, this rule also coalesces non-skewed partitions like
`ReduceNumShufflePartitions`.
Review comment:
nit: when ... is enabled.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]