[ 
https://issues.apache.org/jira/browse/HADOOP-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624365#action_12624365
 ] 

Steve Loughran commented on HADOOP-3987:
----------------------------------------

Here is the specific problems that appear to exist, though I could of course be 
mistaken.

1. All RemoteExceptions that are not an instance of 
DisallowedTaskTrackerException are ignored, nothing is even printed

        return State.STALE;
      } catch (RemoteException re) {
        String reClass = re.getClassName();
        if (DisallowedTaskTrackerException.class.getName().equals(reClass)) {
          LOG.info("Tasktracker disallowed by JobTracker.");
          return State.DENIED;
        }

2. All IOExceptions are logged, but not reported up; the service remains in the 
inner (sleepless) while loop.
      } catch (IOException except) {
        String msg = "Caught exception: " + except.getMessage();
        LOG.error(msg, except);
      }
    }

This may not be the best way to handle network and IO problems. 

3. the code that checks for the system directory off the job service will throw 
an IOException if none is provided, and exception that will be caught and 
logged in the code in (2).  If a JobTracker is returning null to 
getSystemDir(), then every TaskTracker that is bonded to it is going to spin, 
calling getSystemDir() on the server, logging the error and repeating, without 
any delay at all.

I'm not sure what the ideal exception handling policy here should be, but what 
is there today has weaknesses. If the network is playing up, logging 
RemoteExceptions and maybe inserting delays would be good; if 
JobTracker.getSystemDir() is null then the clients should sleep longer in the 
hope that someone will fix the job tracker, rather than spinning.



> TaskTracker.offerService could handle IO and Remote Exceptions better
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-3987
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3987
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Steve Loughran
>
> The core offerService() loop has a try/catch wrapper that catches and 
> processes exceptions. Most cause offerService() to return, which then 
> triggers a sleep and restart in the main loop. But some exceptions are just 
> logged and ignored, which may be inappropriate

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to