[ https://issues.apache.org/jira/browse/HADOOP-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649123#action_12649123 ]
Hairong Kuang commented on HADOOP-4659: --------------------------------------- Hi Steve, I have a few comments on your new patch: 1. I do not think it is right to throw IOException in setupIOstreams as I commented yesterday; If setIOstreams do throw, then most of the time, ConnectionException will get thrown at getConnection in call() and will not deliver to call.error; So it is of no use to wrap the exception there. 2. We wrap local exception on purpose in HADOOP-2811. The reason we did that is to give the RPC caller a right strack trace. A RPC may fail at the time of setting up a connection, sending the request, receiving a response, or a failure caused by another RPC call. Those errors may occur in different threads. The stack trace will be confusing to the caller if we do not wrap them. I want to get waitForProxy work in 0.18.3. Can we agree on a minimum change to make it work? RPC is so fundamental in Hadoop that any minor change may cause unexpected problem. So I am thinking that the smaller the change the better. If you do not like the idea of checking the cause of failure in waitForProxy, I am OK with your idea of Client.call() wrapping ConnectException as ConnectException, SocketTimeoutException as SockettimeoutException, and other exceptions as IOException. As for the junit test failure, I can not reproduce it on my local machine. Can you check why the RPC server is not up so the first thread stuck in waitForProxy? Could you please tell me on which line of the test that the test hang? > Root cause of connection failure is being lost to code that uses it for > delaying startup > ---------------------------------------------------------------------------------------- > > Key: HADOOP-4659 > URL: https://issues.apache.org/jira/browse/HADOOP-4659 > Project: Hadoop Core > Issue Type: Bug > Components: ipc > Affects Versions: 0.18.3 > Reporter: Steve Loughran > Assignee: Steve Loughran > Priority: Blocker > Fix For: 0.18.3 > > Attachments: connectRetry.patch, hadoop-4659.patch, > hadoop-4659.patch, rpcConn.patch > > > ipc.Client the root cause of a connection failure is being lost as the > exception is wrapped, hence the outside code, the one that looks for that > root cause, isn't working as expected. The results is you can't bring up a > task tracker before job tracker, and probably the same for a datanode before > a namenode. The change that triggered this is not yet located, I had thought > it was HADOOP-3844 but I no longer believe this is the case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.