[
https://issues.apache.org/jira/browse/MAPREDUCE-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rahul Jain updated MAPREDUCE-4428:
----------------------------------
Attachment: resrcmgr_bad.txt
Here are the resource manager logs appended for failure case. Note that
resource manager was not restarted any time; and the same stack trace can be
found on the resource manager when the application attempts to unregister
{code}
org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl:
Application doesn't exist in cache appattempt_1341894680756_0017_000001....
{code}
> A failed job is not available under job history if the job is killed right
> around the time job is notified as failed
> ---------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-4428
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4428
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: jobhistoryserver, jobtracker
> Affects Versions: 2.0.0-alpha
> Reporter: Rahul Jain
> Attachments: appMaster_bad.txt, appMaster_good.txt, resrcmgr_bad.txt
>
>
> We have observed this issue consistently running hadoop CDH4 version (based
> upon 2.0 alpha release):
> In case our hadoop client code gets a notification for a completed job (
> using RunningJob object job, with (job.isComplete() &&
> job.isSuccessful()==false)
> the hadoop client code does an unconditional job.killJob() to terminate the
> job.
> With earlier hadoop versions (verified on hadoop 0.20.2 version), we still
> have full access to job logs afterwards through hadoop console. However, when
> using MapReduceV2, the failed hadoop job no longer shows up under jobhistory
> server. Also, the tracking URL of the job still points to the non-existent
> Application master http port.
> Once we removed the call to job.killJob() for failed jobs from our hadoop
> client code, we were able to access the job in job history with mapreduce V2
> as well. Therefore this appears to be a race condition in the job management
> wrt. job history for failed jobs.
> We do have the application master and node manager logs collected for this
> scenario if that'll help isolate the problem and the fix better.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira