[ 
https://issues.apache.org/jira/browse/SPARK-9328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9328:
------------------------------
    Affects Version/s: 1.5.0
                       1.2.1
                       1.3.1
                       1.4.1
     Target Version/s: 1.5.0
             Priority: Blocker  (was: Major)

Marking this as 1.5.0 blocker so that we remember to fix it.  I've opened a 
pull request to highlight the areas of the code involved in fixing this, but 
may not have time to address this prior to the 1.5 feature freeze.  If someone 
wants to collaborate with me on this issue, just let me know and I'd be happy 
to sync up and discuss.

> Netty IO layer should implement read timeouts
> ---------------------------------------------
>
>                 Key: SPARK-9328
>                 URL: https://issues.apache.org/jira/browse/SPARK-9328
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, Spark Core
>    Affects Versions: 1.2.1, 1.3.1, 1.4.1, 1.5.0
>            Reporter: Josh Rosen
>            Priority: Blocker
>
> Spark's network layer does not implement read timeouts which may lead to 
> stalls during shuffle: if a remote shuffle server stalls while responding to 
> a shuffle block fetch request but does not close the socket then the job may 
> block until an OS-level socket timeout occurs.
> I think that we can fix this using Netty's ReadTimeoutHandler 
> (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler).
>   The tricky part of working on this will be figuring out the right place to 
> add the handler and ensuring that we don't introduce performance issues by 
> not re-using sockets.
> Quoting from that linked StackOverflow question:
> {quote}
> Note that the ReadTimeoutHandler is also unaware of whether you have sent a 
> request - it only cares whether data has been read from the socket. If your 
> connection is persistent, and you only want read timeouts to fire when a 
> request has been sent, you'll need to build a request / response aware 
> timeout handler.
> {quote}
> If we want to avoid tearing down connections between shuffles then we may 
> have to do something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to