[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413864#comment-13413864
 ] 

Robert Joseph Evans commented on MAPREDUCE-4428:
------------------------------------------------

OK I think I have an idea, but Sid I would like your opinion on this.  If you 
want to pull in Arun on this too I am happy for his opinion too.

What if we augment the ContainerLaunchContext to have something like a cleanup 
on kill boolean and a cleanup on bad exit boolean.  If cleanup on kill is set 
and the container is forcibly killed or if cleanup on bad exit is set and the 
container exits with a non-zero status, the NM would try to rerun the 
container, but with an environment variable set saying that it is being rerun 
for cleanup.  The NM would give it a configurable amount of time, say 20 
seconds, to do the cleanup, and then if it has not already exited it will shoot 
it.

The RM would need a new variable when the AM is submitted to indicate that this 
should happen, and then if that is set it would turn on cleanup on kill for the 
AM when it is launched, and it would turn on cleanup on bad exit, when it is 
launching the AM for the last retry.

The MR AM would have to be modified to look for the environment variable and 
only do cleanup if it sees it.  The MR client would have to be modified to set 
this boolean variable.
                
> A failed job is not available under job history if the job is killed right 
> around the time job is notified as failed 
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4428
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4428
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobhistoryserver, jobtracker
>    Affects Versions: 2.0.0-alpha
>            Reporter: Rahul Jain
>         Attachments: am_failed_counter_limits.txt, appMaster_bad.txt, 
> appMaster_good.txt, resrcmgr_bad.txt
>
>
> We have observed this issue consistently running hadoop CDH4 version (based 
> upon 2.0 alpha release):
> In case our hadoop client code gets a notification for a completed job ( 
> using RunningJob object job, with (job.isComplete() && 
> job.isSuccessful()==false)
> the hadoop client code does an unconditional job.killJob() to terminate the 
> job.
> With earlier hadoop versions (verified on hadoop 0.20.2 version), we still  
> have full access to job logs afterwards through hadoop console. However, when 
> using MapReduceV2, the failed hadoop job no longer shows up under jobhistory 
> server. Also, the tracking URL of the job still points to the non-existent 
> Application master http port.
> Once we removed the call to job.killJob() for failed jobs from our hadoop 
> client code, we were able to access the job in job history with mapreduce V2 
> as well. Therefore this appears to be a race condition in the job management 
> wrt. job history for failed jobs.
> We do have the application master and node manager logs collected for this 
> scenario if that'll help isolate the problem and the fix better.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to