[ http://issues.apache.org/jira/browse/HADOOP-737?page=all ]

Arun C Murthy reassigned HADOOP-737:
------------------------------------

    Assignee: Arun C Murthy  (was: Sanjay Dahiya)

> TaskTracker's job cleanup loop should check for finished job before deleting 
> local directories
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-737
>                 URL: http://issues.apache.org/jira/browse/HADOOP-737
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Sanjay Dahiya
>         Assigned To: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.10.0
>
>
> TaskTracker  uses jobClient.pollForTaskWithClosedJob() to find tasks which 
> should be closed. This mechanism doesnt pass the information on whether the 
> job is really finished or the task is being killed for some other reason( 
> speculative instance succeeded). Since Tasktracker doesnt know this state it 
> assumes job is finished and deletes local job dir, causing any subsequent 
> tasks on the same task tracker for same job to fail with job.xml not found 
> exception as reported in HADOOP-546 and possibly in HADOOP-543. This causes 
> my patch for HADOOP-76 to fail for a large number of reduce tasks in some 
> cases.
>  
> Same causes extra exceptions in logs while a job is being killed, the first 
> task that gets closed will delete local directories and any other tasks (if 
> any) which are about to get launched will throw this exception. In this case 
> it is less significant is as the job is killed anyways and only logs get 
> extra exceptions. 
> Possible solutions : 
> 1. Add an extra method in InetTrackerProtocol for checking for job status 
> before deleting local directory. 
> 2. Set TaskTracker.RunningJob.localized to false once the local directory is 
> deleted so that new tasks don't look for it there. 
> There is clearly a race condition in this and logs may still get the 
> exception while shutdown but in normal cases it would work. 
> Comments ? 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to