[ http://issues.apache.org/jira/browse/HADOOP-134?page=all ]
Owen O'Malley reassigned HADOOP-134:
------------------------------------
Assign To: Owen O'Malley
> JobTracker trapped in a loop if it fails to localize a task
> -----------------------------------------------------------
>
> Key: HADOOP-134
> URL: http://issues.apache.org/jira/browse/HADOOP-134
> Project: Hadoop
> Type: Bug
> Components: mapred
> Versions: 0.1.0
> Reporter: Runping Qi
> Assignee: Owen O'Malley
> Attachments: task-startup-safety.patch
>
> The symptoms:
> When I ran jobs on a big cluster, I noticed that some jobs got stucked.
> Some map tasks never got started. When I look at the log of the task tracker
> responsible for the tasks, I saw the following exceptions:
> 060413 160702 Lost connection to JobTracker [kry1040/72.30.116.100:50020].
> Retrying...
> java.io.IOException: No valid local directories in property: mapred.local.dir
> at
> org.apache.hadoop.conf.Configuration.getFile(Configuration.java:282)
> at org.apache.hadoop.mapred.JobConf.getLocalFile(JobConf.java:127)
> at
> org.apache.hadoop.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:391)
> at
> org.apache.hadoop.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:383)
> at
> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:270)
> at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:336)
> at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:756)
> The reason for the exception is that the directory hadoop/mapred/local has
> "wrong" owner, thus the task tracker cannot access to it.
> This caused the task tracker stucked into the following loops:
> while (running) {
> boolean staleState = false;
> try {
> // This while-loop attempts reconnects if we get network
> errors
> while (running && ! staleState) {
> try {
> if (offerService() == STALE_STATE) {
> staleState = true;
> }
> } catch (Exception ex) {
> LOG.log(Level.INFO, "Lost connection to
> JobTracker [" + jobTrackAddr + "]. Retrying...", ex);
> try {
> Thread.sleep(5000);
> } catch (InterruptedException ie) {
> }
> }
> }
> } finally {
> close();
> }
> LOG.info("Reinitializing local state");
> initialize();
> }
> Issue 1:
> Method offerService() must catch and handle the exceptions that may be
> thrown from new TaskInProgress() call, and report back to the job tracker if
> it cannot run the task. This way, the task can be assigned to other task
> tracker.
> Issue 2:
> The taskTracker should check whether it can access to the local dir at
> the initialization time, before taking any tasks.
> Runping
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira