c21 commented on a change in pull request #34150:
URL: https://github.com/apache/spark/pull/34150#discussion_r722956944
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
##########
@@ -432,8 +432,8 @@ trait JoinSelectionHelper {
* that is much smaller than other one. Since we does not have the statistic
for number of rows,
* use the size of bytes here as estimation.
*/
- private def muchSmaller(a: LogicalPlan, b: LogicalPlan): Boolean = {
- a.stats.sizeInBytes * 3 <= b.stats.sizeInBytes
+ private def muchSmaller(a: LogicalPlan, b: LogicalPlan, conf: SQLConf):
Boolean = {
Review comment:
shall we update the comment above in line 429 as well?
```
Returns whether plan a is much smaller (3X) than plan b.
```
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -416,6 +416,14 @@ object SQLConf {
.bytesConf(ByteUnit.BYTE)
.createWithDefaultString("10MB")
+ val SHUFFLE_HASH_JOIN_FACTOR = buildConf("spark.sql.shuffleHashJoinFactor")
+ .doc("The shuffle hash join can be selected if the data size of small" +
+ " side multiplied by this factor is still smaller than the large side.")
+ .version("3.2.0")
+ .intConf
+ .checkValue(_ >= 0, "The shuffle hash join factor cannot be negative.")
Review comment:
shouldn't we enforce the value to be larger than 1?
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -416,6 +416,14 @@ object SQLConf {
.bytesConf(ByteUnit.BYTE)
.createWithDefaultString("10MB")
+ val SHUFFLE_HASH_JOIN_FACTOR = buildConf("spark.sql.shuffleHashJoinFactor")
Review comment:
nit: how about `spark.sql.shuffledHashJoinFactor`? We adhere to
shuffle`d` hash join in config name, e.g. existing one -
`spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold`
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -416,6 +416,14 @@ object SQLConf {
.bytesConf(ByteUnit.BYTE)
.createWithDefaultString("10MB")
+ val SHUFFLE_HASH_JOIN_FACTOR = buildConf("spark.sql.shuffleHashJoinFactor")
+ .doc("The shuffle hash join can be selected if the data size of small" +
+ " side multiplied by this factor is still smaller than the large side.")
+ .version("3.2.0")
Review comment:
nit: `3.3.0` now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]