[GitHub] [spark] c21 commented on a change in pull request #34150: [SPARK-36898][SQL] Make the shuffle hash join factor configurable

GitBox Wed, 06 Oct 2021 00:30:31 -0700


c21 commented on a change in pull request #34150:
URL: https://github.com/apache/spark/pull/34150#discussion_r722956944




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
##########
@@ -432,8 +432,8 @@ trait JoinSelectionHelper {
    * that is much smaller than other one. Since we does not have the statistic 
for number of rows,
    * use the size of bytes here as estimation.
    */
-  private def muchSmaller(a: LogicalPlan, b: LogicalPlan): Boolean = {
-    a.stats.sizeInBytes * 3 <= b.stats.sizeInBytes
+  private def muchSmaller(a: LogicalPlan, b: LogicalPlan, conf: SQLConf): 
Boolean = {

Review comment:
       shall we update the comment above in line 429 as well?
   
   ```
   Returns whether plan a is much smaller (3X) than plan b.
   ```

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -416,6 +416,14 @@ object SQLConf {
     .bytesConf(ByteUnit.BYTE)
     .createWithDefaultString("10MB")
 
+  val SHUFFLE_HASH_JOIN_FACTOR = buildConf("spark.sql.shuffleHashJoinFactor")
+    .doc("The shuffle hash join can be selected if the data size of small" +
+      " side multiplied by this factor is still smaller than the large side.")
+    .version("3.2.0")
+    .intConf
+    .checkValue(_ >= 0, "The shuffle hash join factor cannot be negative.")

Review comment:
       shouldn't we enforce the value to be larger than 1?

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -416,6 +416,14 @@ object SQLConf {
     .bytesConf(ByteUnit.BYTE)
     .createWithDefaultString("10MB")
 
+  val SHUFFLE_HASH_JOIN_FACTOR = buildConf("spark.sql.shuffleHashJoinFactor")

Review comment:
       nit: how about `spark.sql.shuffledHashJoinFactor`? We adhere to 
shuffle`d` hash join in config name, e.g. existing one - 
`spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold`

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -416,6 +416,14 @@ object SQLConf {
     .bytesConf(ByteUnit.BYTE)
     .createWithDefaultString("10MB")
 
+  val SHUFFLE_HASH_JOIN_FACTOR = buildConf("spark.sql.shuffleHashJoinFactor")
+    .doc("The shuffle hash join can be selected if the data size of small" +
+      " side multiplied by this factor is still smaller than the large side.")
+    .version("3.2.0")

Review comment:
       nit: `3.3.0` now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] c21 commented on a change in pull request #34150: [SPARK-36898][SQL] Make the shuffle hash join factor configurable

Reply via email to