[jira] Commented: (HADOOP-4659) Root cause of connection failure is being lost to code that uses it for delaying startup

Steve Loughran (JIRA) Mon, 17 Nov 2008 03:21:49 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648138#action_12648138
 ]


Steve Loughran commented on HADOOP-4659:
----------------------------------------

Raghu, 

> why does Client wrap one IOException in another?

I dont know the original reason; HADOOP-3844 retained this feature and included 
the hostname/port at fault which is handy for identifying configuration 
problems. The patch only adds this diagnostics to ConnectExceptions and passes 
the rest up

>is this a vanilla 0.18?

I'm only work with SVN_HEAD; it's present there. If Hairong thinks it came in 
with HADOOP-2188, then it also exists in 0.18, but that will need a different 
patch. 

> Also , "org.apache.hadoop.ipc.Client.call" does not actually catch exception 
> from getConnection() ...

Client.call doesnt catch the exception. The problem is that RPC.waitForProxy 
does, and it handles ConnectException and SocketTimeoutException by logging, 
sleeping, and trying again. This was not happening when the ConnectException 
was being downgraded, so the task tracker was failing if it came up before the 
job tracker, rather than waiting quietly for the tracker to come back up. As a 
result there is a race condition in cluster startup and the cluster is more 
brittle

Here's where the exceptions get picked up in RPC.java

  public static VersionedProtocol waitForProxy(Class protocol,
                                               long clientVersion,
                                               InetSocketAddress addr,
                                               Configuration conf
                                               ) throws IOException {
    while (true) {
      try {
        return getProxy(protocol, clientVersion, addr, conf);
      } catch(ConnectException se) {  // namenode has not been started
        LOG.info("Server at " + addr + " not available yet, Zzzzz...");
      } catch(SocketTimeoutException te) {  // namenode is busy
        LOG.info("Problem connecting to server: " + addr);
      }
      try {
        Thread.sleep(1000);
      } catch (InterruptedException ie) {
        // IGNORE
      }
    }
  }



> Root cause of connection failure is being lost to code that uses it for 
> delaying startup
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4659
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4659
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.18.3
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>             Fix For: 0.18.3
>
>         Attachments: hadoop-4659.patch
>
>
> ipc.Client the root cause of a connection failure is being lost as the 
> exception is wrapped, hence the outside code, the one that looks for that 
> root cause, isn't working as expected. The results is you can't bring up a 
> task tracker before job tracker, and probably the same for a datanode before 
> a  namenode. The change that triggered this is not yet located, I had thought 
> it was HADOOP-3844 but I no longer believe this is the case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4659) Root cause of connection failure is being lost to code that uses it for delaying startup

Reply via email to