[jira] [Commented] (MAPREDUCE-4428) A failed job is not available under job history if the job is killed right around the time job is notified as failed

Robert Joseph Evans (JIRA) Thu, 12 Jul 2012 06:30:42 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13412773#comment-13412773
 ]


Robert Joseph Evans commented on MAPREDUCE-4428:
------------------------------------------------

It looks like someone killed your application

{noformat}
2012-07-11 03:04:28,481 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop   
IP=10.202.50.180        OPERATION=Kill Application Request      
TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1341894680756_0017
2012-07-11 03:04:28,481 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_1341894680756_0017 State change from RUNNING to KILLED
{noformat}

This caused the RM to forget about the application, and it happened just as 
your application was about to fail, so the AM asked to unregister, but the RM 
said I don't know who you are, when in reality it should have said didn't I try 
to kill you?  I don't know who tried to kill this application or really why it 
went to the RM instead of the AM.  The issue here is that normally for 
mapreduce job -kill the client is first going to request that the AM commit 
suicide.  That way it can put the logs where they are supposed to be, before it 
tries to ask the RM to kill the application.  If you do a yarn application kill 
there is no guarantee what the AM will or will not be able to do before it is 
killed. If the AM had been slower the NodeManager would have just sent a kill 
-9 to the AM, and then it would not have had any chance at putting the logs in 
the correct place.  You should probably look at who was on 10.202.50.180 and 
what they were doing that might have asked the RM to kill 
 this AM.

Fixing this in the general case so that the job history logs always are copied 
to the correct place is going to be difficult.  This is because we have to 
insert something that will always run after the AM has exited, it is probably 
best to make it so it will only run after the AM has exited badly, even for a 
kill.  It is possible, just not that simple of a fix.  It is even more 
difficult if we want to handle the case where the node appears to go down just 
as the AM is crashing.  there are lots of corner cases that potentially make 
this very difficult to get right.
                
> A failed job is not available under job history if the job is killed right 
> around the time job is notified as failed 
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4428
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4428
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobhistoryserver, jobtracker
>    Affects Versions: 2.0.0-alpha
>            Reporter: Rahul Jain
>         Attachments: appMaster_bad.txt, appMaster_good.txt, resrcmgr_bad.txt
>
>
> We have observed this issue consistently running hadoop CDH4 version (based 
> upon 2.0 alpha release):
> In case our hadoop client code gets a notification for a completed job ( 
> using RunningJob object job, with (job.isComplete() && 
> job.isSuccessful()==false)
> the hadoop client code does an unconditional job.killJob() to terminate the 
> job.
> With earlier hadoop versions (verified on hadoop 0.20.2 version), we still  
> have full access to job logs afterwards through hadoop console. However, when 
> using MapReduceV2, the failed hadoop job no longer shows up under jobhistory 
> server. Also, the tracking URL of the job still points to the non-existent 
> Application master http port.
> Once we removed the call to job.killJob() for failed jobs from our hadoop 
> client code, we were able to access the job in job history with mapreduce V2 
> as well. Therefore this appears to be a race condition in the job management 
> wrt. job history for failed jobs.
> We do have the application master and node manager logs collected for this 
> scenario if that'll help isolate the problem and the fix better.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4428) A failed job is not available under job history if the job is killed right around the time job is notified as failed

Reply via email to