[ https://issues.apache.org/jira/browse/HADOOP-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648138#action_12648138 ]
Steve Loughran commented on HADOOP-4659: ---------------------------------------- Raghu, > why does Client wrap one IOException in another? I dont know the original reason; HADOOP-3844 retained this feature and included the hostname/port at fault which is handy for identifying configuration problems. The patch only adds this diagnostics to ConnectExceptions and passes the rest up >is this a vanilla 0.18? I'm only work with SVN_HEAD; it's present there. If Hairong thinks it came in with HADOOP-2188, then it also exists in 0.18, but that will need a different patch. > Also , "org.apache.hadoop.ipc.Client.call" does not actually catch exception > from getConnection() ... Client.call doesnt catch the exception. The problem is that RPC.waitForProxy does, and it handles ConnectException and SocketTimeoutException by logging, sleeping, and trying again. This was not happening when the ConnectException was being downgraded, so the task tracker was failing if it came up before the job tracker, rather than waiting quietly for the tracker to come back up. As a result there is a race condition in cluster startup and the cluster is more brittle Here's where the exceptions get picked up in RPC.java public static VersionedProtocol waitForProxy(Class protocol, long clientVersion, InetSocketAddress addr, Configuration conf ) throws IOException { while (true) { try { return getProxy(protocol, clientVersion, addr, conf); } catch(ConnectException se) { // namenode has not been started LOG.info("Server at " + addr + " not available yet, Zzzzz..."); } catch(SocketTimeoutException te) { // namenode is busy LOG.info("Problem connecting to server: " + addr); } try { Thread.sleep(1000); } catch (InterruptedException ie) { // IGNORE } } } > Root cause of connection failure is being lost to code that uses it for > delaying startup > ---------------------------------------------------------------------------------------- > > Key: HADOOP-4659 > URL: https://issues.apache.org/jira/browse/HADOOP-4659 > Project: Hadoop Core > Issue Type: Bug > Components: ipc > Affects Versions: 0.18.3 > Reporter: Steve Loughran > Assignee: Steve Loughran > Fix For: 0.18.3 > > Attachments: hadoop-4659.patch > > > ipc.Client the root cause of a connection failure is being lost as the > exception is wrapped, hence the outside code, the one that looks for that > root cause, isn't working as expected. The results is you can't bring up a > task tracker before job tracker, and probably the same for a datanode before > a namenode. The change that triggered this is not yet located, I had thought > it was HADOOP-3844 but I no longer believe this is the case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.