cloud-fan commented on code in PR #47533:
URL: https://github.com/apache/spark/pull/47533#discussion_r1700339034
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala:
##########
@@ -170,6 +170,16 @@ object StaticSQLConf {
.intConf
.createWithDefault(1000)
+ val SHUFFLE_EXCHANGE_MAX_THREAD_THRESHOLD =
+ buildStaticConf("spark.sql.shuffleExchange.maxThreadThreshold")
+ .internal()
+ .doc("The maximum degree of parallelism for doing preparation of shuffle
exchange, " +
+ "which includes subquery execution, file listing, etc.")
+ .version("4.0.0")
+ .intConf
+ .checkValue(thres => thres > 0 && thres <= 1024, "The threshold must be
in (0,1024].")
+ .createWithDefault(1024)
Review Comment:
The shuffle async job is just waiting for other work (subquery expression
execution) to finish, which is very light-weighted. The broadcast async job
executes a query and collects the result in the driver, which is very heavy.
That's why we can give much larger parallelism to the shuffle async jobs. In
our benchmark we found this number is reasonably good for TPC.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]