[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505900#comment-13505900
 ] 

Jason Lowe commented on MAPREDUCE-4819:
---------------------------------------

We can't have the AM looking for the file in done_intermediate.  The history 
server could have moved it out of there in the interim.  And I don't think we 
want the AM to "know" how to find it's file in the final done location the 
history server puts it in either.  Too much coupling between those systems, 
IMHO.

I think leaving it in the staging directory is the correct solution.  As I 
mentioned, we need to make sure we don't delete the staging directory before 
unregistering with the RM.  That prevents subsequent AM re-attempts right off 
the bat.  And deleting the staging directory before unregistering is happening 
today as discussed in YARN-244, so that problem is not specific to this fix.

Leaving it in staging is straightforward.  No need for extra markers, racing 
with the history server, etc.  And if the staging directory is gone, well the 
AM can't relaunch in the first place, so no issues of re-running and 
re-committing there.  We could still have a discrepancy between the client 
thinking the job succeeded (which it basically did re: its output data) but the 
RM saying it failed, but this is fixable by moving the removal of the staging 
directory to after we unregister from the RM when we fix YARN-244.
                
> AM can rerun job after reporting final job status to the client
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-4819
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Bikas Saha
>            Priority: Critical
>         Attachments: MAPREDUCE-4819.1.patch
>
>
> If the AM reports final job status to the client but then crashes before 
> unregistering with the RM then the RM can run another AM attempt.  Currently 
> AM re-attempts assume that the previous attempts did not reach a final job 
> state, and that causes the job to rerun (from scratch, if the output format 
> doesn't support recovery).
> Re-running the job when we've already told the client the final status of the 
> job is bad for a number of reasons.  If the job failed, it's confusing at 
> best since the client was already told the job failed but the subsequent 
> attempt could succeed.  If the job succeeded there could be data loss, as a 
> subsequent job launched by the client tries to consume the job's output as 
> input just as the re-attempt starts removing output files in preparation for 
> the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to