[ https://issues.apache.org/jira/browse/MAPREDUCE-5562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13786415#comment-13786415 ]
Jian He commented on MAPREDUCE-5562: ------------------------------------ bq. If the error is a read timeout or connection refused then I'm not sure we want the AM to fall over immediately in those cases. Since we are using RMProxy, connection exception are handled in RMProxy and retried automatically, and we can also define other type of exception in RMProxy with different retry policy if needed. For work-preserving restart, AM will hang when RM is down and after RM comes up, it should be able to unregister successfully. bq. What if this is the last AM attempt? Do we really want to orphan the staging directory and fail to generate job history in those cases? Even If this is the last retry and AM normally crashes before unregister, the staging directory is also orphaned. If AM fails inside unregister, as zhijie mentioned, job history files should already be flushed and move to intermediate_done dir, but we do have an orphan staging dir. IMO, making AM behaves like a normal AM failure in case of unregister failure is in the same sense on RM side that an application that fails unregister is deemed as failure. > MR AM should exit when unregister() throws exception > ---------------------------------------------------- > > Key: MAPREDUCE-5562 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5562 > Project: Hadoop Map/Reduce > Issue Type: Bug > Reporter: Zhijie Shen > Assignee: Zhijie Shen > Attachments: MAPREDUCE-5562.1.patch, MAPREDUCE-5562.2.patch, > MAPREDUCE-5562.3.patch > > -- This message was sent by Atlassian JIRA (v6.1#6144)