[ 
https://issues.apache.org/jira/browse/SPARK-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-2563:
-----------------------------------------

    Description: 
In a large EC2 cluster, I often see the first shuffle stage in a job fail due 
to connection timeout exceptions. 

 If the connection attempt times out, the socket gets closed and from [1] we 
get a ClosedChannelException.  We should check if the Socket was closed due to 
a timeout and open a new socket and try to connect. 

FWIW, I was able to work around my problems by increasing the number of SYN 
retries in Linux. (I ran echo 8 > /proc/sys/net/ipv4/tcp_syn_retries)

[1] 
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573

  was:In a large EC2 cluster, I often see the first shuffle stage in a job fail 
due to connection timeout exceptions. We should make the number of retries 
before failing configurable to handle these cases.

        Summary: Re-open sockets to handle connect timeouts  (was: Make number 
of connection retries configurable)

> Re-open sockets to handle connect timeouts
> ------------------------------------------
>
>                 Key: SPARK-2563
>                 URL: https://issues.apache.org/jira/browse/SPARK-2563
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Shivaram Venkataraman
>            Priority: Minor
>
> In a large EC2 cluster, I often see the first shuffle stage in a job fail due 
> to connection timeout exceptions. 
>  If the connection attempt times out, the socket gets closed and from [1] we 
> get a ClosedChannelException.  We should check if the Socket was closed due 
> to a timeout and open a new socket and try to connect. 
> FWIW, I was able to work around my problems by increasing the number of SYN 
> retries in Linux. (I ran echo 8 > /proc/sys/net/ipv4/tcp_syn_retries)
> [1] 
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to