[GitHub] [spark] maropu commented on a change in pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

GitBox Thu, 16 Jul 2020 18:14:24 -0700


maropu commented on a change in pull request #29079:
URL: https://github.com/apache/spark/pull/29079#discussion_r456163594




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2659,12 +2660,24 @@ object SQLConf {
     
buildConf("spark.sql.bucketing.coalesceBucketsInSortMergeJoin.maxBucketRatio")
       .doc("The ratio of the number of two buckets being coalesced should be 
less than or " +
         "equal to this value for bucket coalescing to be applied. This 
configuration only " +
-        s"has an effect when 
'${COALESCE_BUCKETS_IN_SORT_MERGE_JOIN_ENABLED.key}' is set to true.")
+        s"has an effect when '${COALESCE_BUCKETS_IN_JOIN_ENABLED.key}' is set 
to true.")
       .version("3.1.0")
       .intConf
       .checkValue(_ > 0, "The difference must be positive.")
       .createWithDefault(4)
 
+  val COALESCE_BUCKETS_IN_SHUFFLED_HASH_JOIN_MAX_BUCKET_RATIO =
+    
buildConf("spark.sql.bucketing.coalesceBucketsInShuffledHashJoin.maxBucketRatio")
+      .doc("The ratio of the number of two buckets being coalesced should be 
less than or " +
+        "equal to this value for bucket coalescing to be applied. This 
configuration only " +
+        s"has an effect when '${COALESCE_BUCKETS_IN_JOIN_ENABLED.key}' is set 
to true. " +
+        "Note as coalescing reduces parallelism, there might be a higher risk 
for " +
+        "out of memory error at shuffled hash join build side.")
+      .version("3.1.0")
+      .intConf
+      .checkValue(_ > 0, "The difference must be positive.")
+      .createWithDefault(2)

Review comment:
       If we use a single sahred config for bucket coalescing and a user sets a 
higher value at the config, SMJ will likely perform better, but SHJ will likely 
get less parallelism (then, OOM?). If a plan scans too many bucketed tables and 
the plan has both SMJ/SHJ, I personally think it is hard to control this 
coalescing mechanism by using the single shared config. WDYT, @viirya ? I 
suggested this new config by reading [you comment about 
OOM](https://github.com/apache/spark/pull/29079#pullrequestreview-446923677). 
If you think you don't need this config, removing it is okay to me.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] maropu commented on a change in pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

Reply via email to