maropu commented on a change in pull request #29079:
URL: https://github.com/apache/spark/pull/29079#discussion_r456163594
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2659,12 +2660,24 @@ object SQLConf {
buildConf("spark.sql.bucketing.coalesceBucketsInSortMergeJoin.maxBucketRatio")
.doc("The ratio of the number of two buckets being coalesced should be
less than or " +
"equal to this value for bucket coalescing to be applied. This
configuration only " +
- s"has an effect when
'${COALESCE_BUCKETS_IN_SORT_MERGE_JOIN_ENABLED.key}' is set to true.")
+ s"has an effect when '${COALESCE_BUCKETS_IN_JOIN_ENABLED.key}' is set
to true.")
.version("3.1.0")
.intConf
.checkValue(_ > 0, "The difference must be positive.")
.createWithDefault(4)
+ val COALESCE_BUCKETS_IN_SHUFFLED_HASH_JOIN_MAX_BUCKET_RATIO =
+
buildConf("spark.sql.bucketing.coalesceBucketsInShuffledHashJoin.maxBucketRatio")
+ .doc("The ratio of the number of two buckets being coalesced should be
less than or " +
+ "equal to this value for bucket coalescing to be applied. This
configuration only " +
+ s"has an effect when '${COALESCE_BUCKETS_IN_JOIN_ENABLED.key}' is set
to true. " +
+ "Note as coalescing reduces parallelism, there might be a higher risk
for " +
+ "out of memory error at shuffled hash join build side.")
+ .version("3.1.0")
+ .intConf
+ .checkValue(_ > 0, "The difference must be positive.")
+ .createWithDefault(2)
Review comment:
If we use a single sahred config for bucket coalescing and a user sets a
higher value at the config, SMJ will likely perform better, but SHJ will likely
get less parallelism (then, OOM?). If a plan scans too many bucketed tables and
the plan has both SMJ/SHJ, I personally think it is hard to control this
coalescing mechanism by using the single shared config. WDYT, @viirya ? I
suggested this new config by reading [you comment about
OOM](https://github.com/apache/spark/pull/29079#pullrequestreview-446923677).
If you think you don't need this config, removing it is okay to me.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]