[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13786415#comment-13786415
 ] 

Jian He commented on MAPREDUCE-5562:
------------------------------------

bq. If the error is a read timeout or connection refused then I'm not sure we 
want the AM to fall over immediately in those cases.
Since we are using RMProxy,  connection exception are handled in RMProxy and 
retried automatically, and we can also define other type of exception in 
RMProxy with different retry policy if needed. For work-preserving restart, AM 
will hang when RM is down and after RM comes up, it should be able to 
unregister successfully.

bq. What if this is the last AM attempt? Do we really want to orphan the 
staging directory and fail to generate job history in those cases?
Even If this is the last retry and AM normally crashes before unregister, the 
staging directory is also orphaned. 
If AM fails inside unregister, as zhijie mentioned, job history files should 
already be flushed and move to intermediate_done dir, but we do have an orphan 
staging dir.

IMO, making AM behaves like a normal AM failure in case of unregister failure 
is in the same sense on RM side that an application that fails unregister is 
deemed as failure.


> MR AM should exit when unregister() throws exception
> ----------------------------------------------------
>
>                 Key: MAPREDUCE-5562
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5562
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Zhijie Shen
>            Assignee: Zhijie Shen
>         Attachments: MAPREDUCE-5562.1.patch, MAPREDUCE-5562.2.patch, 
> MAPREDUCE-5562.3.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to