[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14056026#comment-14056026
 ] 

Wangda Tan commented on MAPREDUCE-5956:
---------------------------------------

Thanks thoughts provided by [~vinodkv], had a offline discussion with Vinod, 
post summary here,

Basically there're 3 cases need cleanup.
a. Job completed (failed or succeeded, no matter it's lastRetry or not)
b. Failure happened, and captured by MRAppMasterShutDownHook
c. Failure happened, and doesn't capture by MRAppMasterShutDownHook

And for thoughts provided by Vinod,
{code}
1. YARN informs AM that it is the last retry as part of AM start-up or the 
register API
2. YARN informs the AM that this is the last retry as part of AM unregister
3. YARN has a way to run a separate cleanup container after it knows for sure 
that the application finished exhausting all its attempts
{code}

(1) can solve a. and part of b.
Why only part of b? Because it is possible MRAppMasterShutdownHook triggered 
but other possible failure happened causing cleanup not completed.
(2) can only solve a.
Reason is, if we don't have isLastRetry (or mayBeTheLastAttempt) properly set 
at register, we don't know if should do cleanup or not.
(3) can solve a. b. c.
Refer to YARN-2261 for more details.

I tried to work on (1) first, however, I found moving isLastRetry setup from 
MRAppMaster.init to RMCommunicator cause a lots code changes and lots of unit 
test failures, etc. 
So my suggestion is quickly finish (2), make job completed case correct, which 
is the most usual case. And push (3) forward.

I'll upload a patch in method (2) for review soon.

Thanks,
Wangda

> MapReduce AM should not use maxAttempts to determine if this is the last retry
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5956
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5956
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>          Components: applicationmaster, mrv2
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Wangda Tan
>            Priority: Blocker
>
> Found this while reviewing YARN-2074. The problem is that after YARN-2074, we 
> don't count AM preemption towards AM failures on RM side, but MapReduce AM 
> itself checks the attempt id against the max-attempt count to determine if 
> this is the last attempt.
> {code}
>     public void computeIsLastAMRetry() {
>       isLastAMRetry = appAttemptID.getAttemptId() >= maxAppAttempts;
>     }
> {code}
> This causes issues w.r.t deletion of staging directory etc..



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to