[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13786353#comment-13786353
 ] 

Zhijie Shen commented on MAPREDUCE-5562:
----------------------------------------

bq. How does this change interact with an RM restart scenario – will it cause 
every AM trying to unregister to crash?

AFAIK, if RM restarting happens while AM is unregistering, the unregistering 
process will throw exception because RM cannot find lastResponse after its 
restarting. [~jianhe], would you please confirm it? Then, AM will exit here. 
However, the exception can be caused by other issues, such as other RM internal 
error, or network break.

bq. If the error is a read timeout or connection refused then I'm not sure we 
want the AM to fall over immediately in those cases, especially when 
work-preserving restart is added to the RM. We certainly don't want clients to 
do so in the same scenarios. If the error is a bad token or something else that 
is not going to succeed on a retry then yeah, we should shut down the AM.

I agree, and it's not a perfect solution. Ideally, we should figure out 
different types of exceptions, and handle them separately. Now, I'm trying to 
be so harsh to the unregister exceptions is try to avoid the race conditions 
that we found recently or are still unseen as much as possible, unblocking 
release 2.2.0. Maybe it is just a short term solution, we may come back to it 
later to elaborate the fix. Thoughts?

bq. What if this is the last AM attempt? Do we really want to orphan the 
staging directory and fail to generate job history in those cases?

Yes, it's not good, but even if unregister() succeeds,  the staging directory 
will be likely not to be cleaned due to some other failure in between. Job 
history is now moved to JHS before unregister(), but anyway it is also at the 
risk of AM crash.

bq. If we end up deciding System.exit is really the proper thing to do here 
then it should be using ExitUtil rather than calling System.exit directly.

+1, seem to be the right thing

> MR AM should exit when unregister() throws exception
> ----------------------------------------------------
>
>                 Key: MAPREDUCE-5562
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5562
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Zhijie Shen
>            Assignee: Zhijie Shen
>         Attachments: MAPREDUCE-5562.1.patch, MAPREDUCE-5562.2.patch, 
> MAPREDUCE-5562.3.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to