[
https://issues.apache.org/jira/browse/SPARK-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039396#comment-14039396
]
Charles Reiss commented on SPARK-704:
-------------------------------------
It's been a while since I reported this issue, so it may have been incidentally
fixed.
But this problem was with a remote node failure _after_ a message (or several
messages) was successfully sent to that node but before a response was
received. So, there would be no message to send to trigger a failing attempt to
write to the channel.
If there's a corresponding ReceivingConnection, then the remote node death
would be detected via a failed read, but I believe the code in
ConnectionManager#removeConnection would not reliably trigger the
MessageStatuses.
> ConnectionManager sometimes cannot detect loss of sending connections
> ---------------------------------------------------------------------
>
> Key: SPARK-704
> URL: https://issues.apache.org/jira/browse/SPARK-704
> Project: Spark
> Issue Type: Bug
> Reporter: Charles Reiss
> Assignee: Henry Saputra
>
> ConnectionManager currently does not detect when SendingConnections
> disconnect except if it is trying to send through them. As a result, a node
> failure just after a connection is initiated but before any acknowledgement
> messages can be sent may result in a hang.
> ConnectionManager has code intended to detect this case by detecting the
> failure of a corresponding ReceivingConnection, but this code assumes that
> the remote host:port of the ReceivingConnection is the same as the
> ConnectionManagerId, which is almost never true. Additionally, there does not
> appear to be any reason to assume a corresponding ReceivingConnection will
> exist.
--
This message was sent by Atlassian JIRA
(v6.2#6252)