cloud-fan commented on issue #22173: [SPARK-24355] Spark external shuffle server improvement to better handle block fetch requests. URL: https://github.com/apache/spark/pull/22173#issuecomment-570509238 We hit significant performance regression in our internal workload caused by this commit. After this commit, the executor can handle at most N chunk fetch requests at the same time, where N is the value of `spark.shuffle.io.serverThreads` * `spark.shuffle.server.chunkFetchHandlerThreadsPercent`. Previously, it was unlimited, and most of the time we can saturate the underlying channel. This commit does fix a nasty problem, and I'm fine with it even if it may introduce perf regression, but there should be a way to turn it off. Unfortunately, we can't turn off this feature. We can set `spark.shuffle.server.chunkFetchHandlerThreadsPercent` to a large value so that we can handle many chunk fetch requests at the same time, but it's hard to pick a good value which is not too large and can saturate the channel. Looking back at this problem, I think we can either create a dedicated channel for non chunk fetch request, or ask netty to handle channel write of non chunk fetch request first. Both seem hard to implement. Shall we revert it first, and think of a good fix later?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
