[jira] Commented: (HADOOP-4659) Root cause of connection failure is being lost to code that uses it for delaying startup

Hairong Kuang (JIRA) Wed, 19 Nov 2008 10:53:07 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649123#action_12649123
 ]


Hairong Kuang commented on HADOOP-4659:
---------------------------------------

Hi Steve, I have a few comments on your new patch:
1. I do not think it is right to throw IOException in setupIOstreams as I 
commented yesterday; If setIOstreams do throw, then most of the time, 
ConnectionException will get thrown at getConnection in call() and will not 
deliver to call.error; So it is of no use to wrap the exception there.
2. We wrap local exception on purpose in HADOOP-2811. The reason we did that is 
to give the RPC caller a right strack trace. A RPC may fail at the time of 
setting up a connection, sending the request, receiving a response, or a 
failure caused by another RPC call. Those errors may occur in different 
threads. The stack trace will be confusing to the caller if we do not wrap them.

I want to get waitForProxy work in 0.18.3. Can we agree on a minimum change to 
make it work? RPC is so fundamental in Hadoop that any minor change may cause 
unexpected problem. So I am thinking that the smaller the change the better. If 
you do not like the idea of checking the cause of failure in waitForProxy, I am 
OK with your idea of Client.call() wrapping ConnectException as 
ConnectException, SocketTimeoutException as SockettimeoutException, and other 
exceptions as IOException.

As for the junit test failure, I can not reproduce it on my local machine. Can 
you check why the RPC server is not up so the first thread stuck in 
waitForProxy? Could you please tell me on which line of the test that the test 
hang?

> Root cause of connection failure is being lost to code that uses it for 
> delaying startup
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4659
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4659
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.18.3
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Blocker
>             Fix For: 0.18.3
>
>         Attachments: connectRetry.patch, hadoop-4659.patch, 
> hadoop-4659.patch, rpcConn.patch
>
>
> ipc.Client the root cause of a connection failure is being lost as the 
> exception is wrapped, hence the outside code, the one that looks for that 
> root cause, isn't working as expected. The results is you can't bring up a 
> task tracker before job tracker, and probably the same for a datanode before 
> a  namenode. The change that triggered this is not yet located, I had thought 
> it was HADOOP-3844 but I no longer believe this is the case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4659) Root cause of connection failure is being lost to code that uses it for delaying startup

Reply via email to