Josh Rosen created SPARK-9328:
---------------------------------

             Summary: Netty IO layer should implement read timeouts
                 Key: SPARK-9328
                 URL: https://issues.apache.org/jira/browse/SPARK-9328
             Project: Spark
          Issue Type: Bug
          Components: Shuffle, Spark Core
            Reporter: Josh Rosen
            Assignee: Josh Rosen


Spark's network layer does not implement read timeouts which may lead to 
whole-job stalls during shuffle: if a remote shuffle server stalls while 
responding to a shuffle block fetch request but does not close the socket then 
the job may block until an OS-level socket timeout occurs.

I think that we can fix this using Netty's ReadTimeoutHandler 
(http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler).
  The tricky part of working on this will be figuring out the right place to 
add the handler and ensuring that we don't introduce performance issues by not 
re-using sockets.

Quoting from that linked StackOverflow question:

{quote}
Note that the ReadTimeoutHandler is also unaware of whether you have sent a 
request - it only cares whether data has been read from the socket. If your 
connection is persistent, and you only want read timeouts to fire when a 
request has been sent, you'll need to build a request / response aware timeout 
handler.
{quote}

If we want to avoid tearing down connections between shuffles then we may have 
to do something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to