[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13412950#comment-13412950
 ] 

Rahul Jain commented on MAPREDUCE-4428:
---------------------------------------

Robert,

To make the user perspective clear here:

This grid is a single user managed grid, no other process was running at the 
time and no one else tried to do kill on the job here.

The sequence is:

a) The job creator application submitted the job to hadoop grid.

b) Max retry count was set to 1 for both mappers and reducers; so as soon as a 
task failed, the system (AM?) decided to kill all other tasks.

c) The submitter application is waiting in a sleep loop, waking up every 1 
second to check the status of the task
calling: JobClient.getJob()

d) When the above condition happens, the application receives the running job 
status as completed , failed (isSuccessful()=false, isComplete()=true on 
RunningJob object)

e) The application issues a killJob() on the running job object at this time

f) As a result, nothing is accessible in job history from hadoop console, even 
the AM container logs cannot be accessed.

Removing (e) from the above sequence make logs accessible again. As I 
mentioned, with older version of map-reduce, we never encountered the issues of 
logs getting lost. I believe we need to handle the case of user initiated 
'KILL' of the job better in MapReduceV2; 90% of the time we look at map-reduce 
logs only for failed and killed jobs; so this functionality should work 
reliably as much as possible.




                
> A failed job is not available under job history if the job is killed right 
> around the time job is notified as failed 
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4428
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4428
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobhistoryserver, jobtracker
>    Affects Versions: 2.0.0-alpha
>            Reporter: Rahul Jain
>         Attachments: appMaster_bad.txt, appMaster_good.txt, resrcmgr_bad.txt
>
>
> We have observed this issue consistently running hadoop CDH4 version (based 
> upon 2.0 alpha release):
> In case our hadoop client code gets a notification for a completed job ( 
> using RunningJob object job, with (job.isComplete() && 
> job.isSuccessful()==false)
> the hadoop client code does an unconditional job.killJob() to terminate the 
> job.
> With earlier hadoop versions (verified on hadoop 0.20.2 version), we still  
> have full access to job logs afterwards through hadoop console. However, when 
> using MapReduceV2, the failed hadoop job no longer shows up under jobhistory 
> server. Also, the tracking URL of the job still points to the non-existent 
> Application master http port.
> Once we removed the call to job.killJob() for failed jobs from our hadoop 
> client code, we were able to access the job in job history with mapreduce V2 
> as well. Therefore this appears to be a race condition in the job management 
> wrt. job history for failed jobs.
> We do have the application master and node manager logs collected for this 
> scenario if that'll help isolate the problem and the fix better.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to