[ https://issues.apache.org/jira/browse/MAPREDUCE-5562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13786353#comment-13786353 ]
Zhijie Shen commented on MAPREDUCE-5562: ---------------------------------------- bq. How does this change interact with an RM restart scenario – will it cause every AM trying to unregister to crash? AFAIK, if RM restarting happens while AM is unregistering, the unregistering process will throw exception because RM cannot find lastResponse after its restarting. [~jianhe], would you please confirm it? Then, AM will exit here. However, the exception can be caused by other issues, such as other RM internal error, or network break. bq. If the error is a read timeout or connection refused then I'm not sure we want the AM to fall over immediately in those cases, especially when work-preserving restart is added to the RM. We certainly don't want clients to do so in the same scenarios. If the error is a bad token or something else that is not going to succeed on a retry then yeah, we should shut down the AM. I agree, and it's not a perfect solution. Ideally, we should figure out different types of exceptions, and handle them separately. Now, I'm trying to be so harsh to the unregister exceptions is try to avoid the race conditions that we found recently or are still unseen as much as possible, unblocking release 2.2.0. Maybe it is just a short term solution, we may come back to it later to elaborate the fix. Thoughts? bq. What if this is the last AM attempt? Do we really want to orphan the staging directory and fail to generate job history in those cases? Yes, it's not good, but even if unregister() succeeds, the staging directory will be likely not to be cleaned due to some other failure in between. Job history is now moved to JHS before unregister(), but anyway it is also at the risk of AM crash. bq. If we end up deciding System.exit is really the proper thing to do here then it should be using ExitUtil rather than calling System.exit directly. +1, seem to be the right thing > MR AM should exit when unregister() throws exception > ---------------------------------------------------- > > Key: MAPREDUCE-5562 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5562 > Project: Hadoop Map/Reduce > Issue Type: Bug > Reporter: Zhijie Shen > Assignee: Zhijie Shen > Attachments: MAPREDUCE-5562.1.patch, MAPREDUCE-5562.2.patch, > MAPREDUCE-5562.3.patch > > -- This message was sent by Atlassian JIRA (v6.1#6144)