[ https://issues.apache.org/jira/browse/HADOOP-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hairong Kuang updated HADOOP-4659: ---------------------------------- Attachment: rpcConn.patch This patch checks the cause of the failure when setting up a RPC client tries to connect to a RPC server. It retries if it is caused by an unavailable or busy server. It adds a new static method waitForProxy with a timeout mainly for the purpose of testing. A unit test is added to TestRPC to makes sure that client retries. A manual test is also conducted that starting a DataNode without starting NameNode causes DataNode to retry. Steve, could you please review and test the patch in your setup? I appreciate any of your feedback. > Root cause of connection failure is being lost to code that uses it for > delaying startup > ---------------------------------------------------------------------------------------- > > Key: HADOOP-4659 > URL: https://issues.apache.org/jira/browse/HADOOP-4659 > Project: Hadoop Core > Issue Type: Bug > Components: ipc > Affects Versions: 0.18.3 > Reporter: Steve Loughran > Assignee: Steve Loughran > Priority: Blocker > Fix For: 0.18.3 > > Attachments: connectRetry.patch, hadoop-4659.patch, rpcConn.patch > > > ipc.Client the root cause of a connection failure is being lost as the > exception is wrapped, hence the outside code, the one that looks for that > root cause, isn't working as expected. The results is you can't bring up a > task tracker before job tracker, and probably the same for a datanode before > a namenode. The change that triggered this is not yet located, I had thought > it was HADOOP-3844 but I no longer believe this is the case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.