[ 
https://issues.apache.org/jira/browse/SPARK-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041091#comment-14041091
 ] 

Henry Saputra commented on SPARK-704:
-------------------------------------

Thanks a lot to [~woggle] and [~mridulm80] for clarifying the issue and add 
additional comments to help make it clear what is happening.

Yes, since the NIO's channel for SendingConnection listen to both write and 
read (from for-loop detection in the ConnectionManager) any loss connection 
will be detected by the SendingConnection's channel.

My concern is about "hang" issue that Charles mentioned in the issue 
description, I tried to reproduce by shutting down the node manually but could 
not really get that situation.
Since this is async IO there is no way to know about failure of remote node 
when there is no activity at the socket, like Mridul, mentioned other than 
sending keepalive messages.

> ConnectionManager sometimes cannot detect loss of sending connections
> ---------------------------------------------------------------------
>
>                 Key: SPARK-704
>                 URL: https://issues.apache.org/jira/browse/SPARK-704
>             Project: Spark
>          Issue Type: Bug
>            Reporter: Charles Reiss
>            Assignee: Henry Saputra
>
> ConnectionManager currently does not detect when SendingConnections 
> disconnect except if it is trying to send through them. As a result, a node 
> failure just after a connection is initiated but before any acknowledgement 
> messages can be sent may result in a hang.
> ConnectionManager has code intended to detect this case by detecting the 
> failure of a corresponding ReceivingConnection, but this code assumes that 
> the remote host:port of the ReceivingConnection is the same as the 
> ConnectionManagerId, which is almost never true. Additionally, there does not 
> appear to be any reason to assume a corresponding ReceivingConnection will 
> exist.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to