Josh Rosen created SPARK-9328:
---------------------------------
Summary: Netty IO layer should implement read timeouts
Key: SPARK-9328
URL: https://issues.apache.org/jira/browse/SPARK-9328
Project: Spark
Issue Type: Bug
Components: Shuffle, Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen
Spark's network layer does not implement read timeouts which may lead to
whole-job stalls during shuffle: if a remote shuffle server stalls while
responding to a shuffle block fetch request but does not close the socket then
the job may block until an OS-level socket timeout occurs.
I think that we can fix this using Netty's ReadTimeoutHandler
(http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler).
The tricky part of working on this will be figuring out the right place to
add the handler and ensuring that we don't introduce performance issues by not
re-using sockets.
Quoting from that linked StackOverflow question:
{quote}
Note that the ReadTimeoutHandler is also unaware of whether you have sent a
request - it only cares whether data has been read from the socket. If your
connection is persistent, and you only want read timeouts to fire when a
request has been sent, you'll need to build a request / response aware timeout
handler.
{quote}
If we want to avoid tearing down connections between shuffles then we may have
to do something like this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]