[ http://issues.apache.org/jira/browse/HADOOP-737?page=all ]
Arun C Murthy reassigned HADOOP-737:
------------------------------------
Assignee: Arun C Murthy (was: Sanjay Dahiya)
> TaskTracker's job cleanup loop should check for finished job before deleting
> local directories
> ----------------------------------------------------------------------------------------------
>
> Key: HADOOP-737
> URL: http://issues.apache.org/jira/browse/HADOOP-737
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Reporter: Sanjay Dahiya
> Assigned To: Arun C Murthy
> Priority: Critical
> Fix For: 0.10.0
>
>
> TaskTracker uses jobClient.pollForTaskWithClosedJob() to find tasks which
> should be closed. This mechanism doesnt pass the information on whether the
> job is really finished or the task is being killed for some other reason(
> speculative instance succeeded). Since Tasktracker doesnt know this state it
> assumes job is finished and deletes local job dir, causing any subsequent
> tasks on the same task tracker for same job to fail with job.xml not found
> exception as reported in HADOOP-546 and possibly in HADOOP-543. This causes
> my patch for HADOOP-76 to fail for a large number of reduce tasks in some
> cases.
>
> Same causes extra exceptions in logs while a job is being killed, the first
> task that gets closed will delete local directories and any other tasks (if
> any) which are about to get launched will throw this exception. In this case
> it is less significant is as the job is killed anyways and only logs get
> extra exceptions.
> Possible solutions :
> 1. Add an extra method in InetTrackerProtocol for checking for job status
> before deleting local directory.
> 2. Set TaskTracker.RunningJob.localized to false once the local directory is
> deleted so that new tasks don't look for it there.
> There is clearly a race condition in this and logs may still get the
> exception while shutdown but in normal cases it would work.
> Comments ?
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira