[
https://issues.apache.org/jira/browse/MAPREDUCE-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13412950#comment-13412950
]
Rahul Jain commented on MAPREDUCE-4428:
---------------------------------------
Robert,
To make the user perspective clear here:
This grid is a single user managed grid, no other process was running at the
time and no one else tried to do kill on the job here.
The sequence is:
a) The job creator application submitted the job to hadoop grid.
b) Max retry count was set to 1 for both mappers and reducers; so as soon as a
task failed, the system (AM?) decided to kill all other tasks.
c) The submitter application is waiting in a sleep loop, waking up every 1
second to check the status of the task
calling: JobClient.getJob()
d) When the above condition happens, the application receives the running job
status as completed , failed (isSuccessful()=false, isComplete()=true on
RunningJob object)
e) The application issues a killJob() on the running job object at this time
f) As a result, nothing is accessible in job history from hadoop console, even
the AM container logs cannot be accessed.
Removing (e) from the above sequence make logs accessible again. As I
mentioned, with older version of map-reduce, we never encountered the issues of
logs getting lost. I believe we need to handle the case of user initiated
'KILL' of the job better in MapReduceV2; 90% of the time we look at map-reduce
logs only for failed and killed jobs; so this functionality should work
reliably as much as possible.
> A failed job is not available under job history if the job is killed right
> around the time job is notified as failed
> ---------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-4428
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4428
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: jobhistoryserver, jobtracker
> Affects Versions: 2.0.0-alpha
> Reporter: Rahul Jain
> Attachments: appMaster_bad.txt, appMaster_good.txt, resrcmgr_bad.txt
>
>
> We have observed this issue consistently running hadoop CDH4 version (based
> upon 2.0 alpha release):
> In case our hadoop client code gets a notification for a completed job (
> using RunningJob object job, with (job.isComplete() &&
> job.isSuccessful()==false)
> the hadoop client code does an unconditional job.killJob() to terminate the
> job.
> With earlier hadoop versions (verified on hadoop 0.20.2 version), we still
> have full access to job logs afterwards through hadoop console. However, when
> using MapReduceV2, the failed hadoop job no longer shows up under jobhistory
> server. Also, the tracking URL of the job still points to the non-existent
> Application master http port.
> Once we removed the call to job.killJob() for failed jobs from our hadoop
> client code, we were able to access the job in job history with mapreduce V2
> as well. Therefore this appears to be a race condition in the job management
> wrt. job history for failed jobs.
> We do have the application master and node manager logs collected for this
> scenario if that'll help isolate the problem and the fix better.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira