GitHub user redsanket opened a pull request: https://github.com/apache/spark/pull/22173
[SPARK-24335] Spark external shuffle server improvement to better handle block fetch requests. ## What changes were proposed in this pull request? This is a continuation PR from https://github.com/apache/spark/pull/21402 Since there is no activity, I am willing to take this over and made few minor changes and tested them. Adding the description from the earlier PR Description: Right now, the default server side netty handler threads is 2 * # cores, and can be further configured with parameter spark.shuffle.io.serverThreads. In order to process a client request, it would require one available server netty handler thread. However, when the server netty handler threads start to process ChunkFetchRequests, they will be blocked on disk I/O, mostly due to disk contentions from the random read operations initiated by all the ChunkFetchRequests received from clients. As a result, when the shuffle server is serving many concurrent ChunkFetchRequests, the server side netty handler threads could all be blocked on reading shuffle files, thus leaving no handler thread available to process other types of requests which should all be very quick to process. This issue could potentially be fixed by limiting the number of netty handler threads that could get blocked when processing ChunkFetchRequest. We have a patch to do this by using a separate EventLoopGroup with a dedicated ChannelHandler to process ChunkFetchRequest. This enables shuffle server to reserve netty handler threads for non-ChunkFetchRequest, thus enabling consistent processing time for these requests which are fast to process. After deploying the patch in our infrastructure, we no longer see timeout issues with either executor registration with local shuffle server or shuffle client establishing connection with remote shuffle server. (Please fill in changes proposed in this fix) ## How was this patch tested? Unit tests and stress testing. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/redsanket/spark SPARK-24335 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22173.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22173 ---- commit 44bb55759a4059d8bb0e60c361a8a3210a234f92 Author: Sanket Chintapalli <schintap@...> Date: 2018-08-21T17:34:31Z SPARK-24355 Spark external shuffle server improvement to better handle block fetch requests. commit 3bab74ca84fe1b6682000741b958c8792f792472 Author: Sanket Chintapalli <schintap@...> Date: 2018-08-21T16:49:50Z make chunk fetch handler threads as a percentage of transport server threads ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org