[ http://issues.apache.org/jira/browse/HADOOP-134?page=all ]

Owen O'Malley reassigned HADOOP-134:
------------------------------------

    Assign To: Owen O'Malley

> JobTracker trapped in a loop if it fails to localize a task
> -----------------------------------------------------------
>
>          Key: HADOOP-134
>          URL: http://issues.apache.org/jira/browse/HADOOP-134
>      Project: Hadoop
>         Type: Bug

>   Components: mapred
>     Versions: 0.1.0
>     Reporter: Runping Qi
>     Assignee: Owen O'Malley
>  Attachments: task-startup-safety.patch
>
> The symptoms:
>     When I ran  jobs on a big cluster, I noticed that some jobs got stucked. 
> Some map tasks never got started. When I look at the log of the task tracker 
> responsible for the tasks, I saw the following exceptions:
> 060413 160702 Lost connection to JobTracker [kry1040/72.30.116.100:50020].  
> Retrying...
> java.io.IOException: No valid local directories in property: mapred.local.dir
>         at 
> org.apache.hadoop.conf.Configuration.getFile(Configuration.java:282)
>         at org.apache.hadoop.mapred.JobConf.getLocalFile(JobConf.java:127)
>         at 
> org.apache.hadoop.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:391)
>         at 
> org.apache.hadoop.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:383)
>         at 
> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:270)
>         at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:336)
>         at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:756)
> The reason for the exception is that the directory hadoop/mapred/local has 
> "wrong" owner, thus the task tracker cannot access to it.
> This caused the task tracker stucked into the following loops:
>             while (running) {
>                 boolean staleState = false;
>                 try {
>                     // This while-loop attempts reconnects if we get network 
> errors
>                     while (running && ! staleState) {
>                         try {
>                             if (offerService() == STALE_STATE) {
>                                 staleState = true;
>                             }
>                         } catch (Exception ex) {
>                             LOG.log(Level.INFO, "Lost connection to 
> JobTracker [" + jobTrackAddr + "].  Retrying...", ex);
>                             try {
>                                 Thread.sleep(5000);
>                             } catch (InterruptedException ie) {
>                             }
>                         }
>                     }
>                 } finally {
>                     close();
>                 }
>                 LOG.info("Reinitializing local state");
>                 initialize();
>             }
> Issue 1:
>     Method offerService() must catch and handle the exceptions that may be 
> thrown from new TaskInProgress() call, and report back to the job tracker if 
> it cannot run the task. This way, the task can be assigned to other task 
> tracker.
> Issue 2:
>     The taskTracker should check whether it can access to the local dir at 
> the initialization time, before taking any tasks.
> Runping

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to