GitHub user redsanket opened a pull request:

    https://github.com/apache/spark/pull/22173

    [SPARK-24335] Spark external shuffle server improvement to better handle 
block fetch requests.

    ## What changes were proposed in this pull request?
    
    This is a continuation PR from https://github.com/apache/spark/pull/21402
    Since there is no activity, I am willing to take this over and made few 
minor changes and tested them.
    Adding the description from the earlier PR
    
    Description:
    Right now, the default server side netty handler threads is 2 * # cores, 
and can be further configured with parameter spark.shuffle.io.serverThreads.
    In order to process a client request, it would require one available server 
netty handler thread.
    However, when the server netty handler threads start to process 
ChunkFetchRequests, they will be blocked on disk I/O, mostly due to disk 
contentions from the random read operations initiated by all the 
ChunkFetchRequests received from clients.
    As a result, when the shuffle server is serving many concurrent 
ChunkFetchRequests, the server side netty handler threads could all be blocked 
on reading shuffle files, thus leaving no handler thread available to process 
other types of requests which should all be very quick to process.
    
    This issue could potentially be fixed by limiting the number of netty 
handler threads that could get blocked when processing ChunkFetchRequest. We 
have a patch to do this by using a separate EventLoopGroup with a dedicated 
ChannelHandler to process ChunkFetchRequest. This enables shuffle server to 
reserve netty handler threads for non-ChunkFetchRequest, thus enabling 
consistent processing time for these requests which are fast to process. After 
deploying the patch in our infrastructure, we no longer see timeout issues with 
either executor registration with local shuffle server or shuffle client 
establishing connection with remote shuffle server.
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    Unit tests and stress testing.
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/redsanket/spark SPARK-24335

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22173.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22173
    
----
commit 44bb55759a4059d8bb0e60c361a8a3210a234f92
Author: Sanket Chintapalli <schintap@...>
Date:   2018-08-21T17:34:31Z

    SPARK-24355 Spark external shuffle server improvement to better handle 
block fetch requests.

commit 3bab74ca84fe1b6682000741b958c8792f792472
Author: Sanket Chintapalli <schintap@...>
Date:   2018-08-21T16:49:50Z

    make chunk fetch handler threads as a percentage of transport server threads

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to