[
https://issues.apache.org/jira/browse/MAPREDUCE-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413864#comment-13413864
]
Robert Joseph Evans commented on MAPREDUCE-4428:
------------------------------------------------
OK I think I have an idea, but Sid I would like your opinion on this. If you
want to pull in Arun on this too I am happy for his opinion too.
What if we augment the ContainerLaunchContext to have something like a cleanup
on kill boolean and a cleanup on bad exit boolean. If cleanup on kill is set
and the container is forcibly killed or if cleanup on bad exit is set and the
container exits with a non-zero status, the NM would try to rerun the
container, but with an environment variable set saying that it is being rerun
for cleanup. The NM would give it a configurable amount of time, say 20
seconds, to do the cleanup, and then if it has not already exited it will shoot
it.
The RM would need a new variable when the AM is submitted to indicate that this
should happen, and then if that is set it would turn on cleanup on kill for the
AM when it is launched, and it would turn on cleanup on bad exit, when it is
launching the AM for the last retry.
The MR AM would have to be modified to look for the environment variable and
only do cleanup if it sees it. The MR client would have to be modified to set
this boolean variable.
> A failed job is not available under job history if the job is killed right
> around the time job is notified as failed
> ---------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-4428
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4428
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: jobhistoryserver, jobtracker
> Affects Versions: 2.0.0-alpha
> Reporter: Rahul Jain
> Attachments: am_failed_counter_limits.txt, appMaster_bad.txt,
> appMaster_good.txt, resrcmgr_bad.txt
>
>
> We have observed this issue consistently running hadoop CDH4 version (based
> upon 2.0 alpha release):
> In case our hadoop client code gets a notification for a completed job (
> using RunningJob object job, with (job.isComplete() &&
> job.isSuccessful()==false)
> the hadoop client code does an unconditional job.killJob() to terminate the
> job.
> With earlier hadoop versions (verified on hadoop 0.20.2 version), we still
> have full access to job logs afterwards through hadoop console. However, when
> using MapReduceV2, the failed hadoop job no longer shows up under jobhistory
> server. Also, the tracking URL of the job still points to the non-existent
> Application master http port.
> Once we removed the call to job.killJob() for failed jobs from our hadoop
> client code, we were able to access the job in job history with mapreduce V2
> as well. Therefore this appears to be a race condition in the job management
> wrt. job history for failed jobs.
> We do have the application master and node manager logs collected for this
> scenario if that'll help isolate the problem and the fix better.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira