[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13751973#comment-13751973
 ] 

Jian He commented on MAPREDUCE-5441:
------------------------------------

Thanks [~rohithsharma] for reporting this problem.

Earlier this problem is not easily reproduced on my side because at that time 
MR choose to ignore the Invalid AMRMToken exception after RM restarts and never 
explicitly sends the JOB_AM_REBOOT event and keeps alive until signally killed 
by NM. After that JobClient can just quickly switch to the new AM.

Now MR is changed to explicitly send the JOB_AM_REBOOT event in case of Invalid 
AMRMToken exception(should be fixed later) and JobClient has more chance to see 
the ERROR state of the JOB which leads JobClient to exit prematurely.
Reproduced this problem by putting long sleep in MRAppMaster.showDownJob() for 
the normal shutDown and MRAppMasterShutdownHook in case of signally shutDown, 
so that JobClient has great chance to see the ERROR state.

Uploaded a patch that in case of REBOOT state of the Job return the external 
state as RUNNING to prevent JobClient from prematurely exiting
The above manual test passed with the patch and failed without.
                
> JobClient exit whenever RM issue Reboot command to 1st attempt App Master.
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5441
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5441
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, client
>    Affects Versions: 2.1.0-beta, 2.0.5-alpha, 2.1.1-beta
>            Reporter: Rohith Sharma K S
>            Assignee: Jian He
>         Attachments: MAPREDUCE-5441.patch
>
>
> When RM issue Reboot command to app master, app master shutdown gracefully. 
> All the history event are writtent to hdfs with job status set as ERROR. 
> Jobclient get job state as ERROR and exit. 
> But RM launches 2nd attempt app master where no client are there to get job 
> status.In RM UI, job status is displayed as SUCCESS but for client Job is 
> Failed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to