[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

Jason Lowe (JIRA) Wed, 28 Nov 2012 13:50:00 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505935#comment-13505935
 ]


Jason Lowe commented on MAPREDUCE-4819:
---------------------------------------

Took a look at the patch, and I think we are missing some critical corner 
cases.  For example, if we finish committing the job and the committer is using 
a marker of sorts (e.g.: _SUCCESS), then we could trigger downstream jobs to 
run *before* the job history is completely closed.  I believe Oozie is polling 
for the _SUCCESS marker, for example.  If we crash after committing but before 
writing the job finished record then we could end up re-committing again while 
another job is attempting to consume our output, leading to potential data loss 
even though both jobs would have "SUCCEEDED".  That's a Bad Thing.

I think the crux of the issue is that we must not commit twice.  The act of 
committing is what could trigger downstream jobs or in itself not be 
repeatable/recoverable, so we should treat AM crashes during job commit much 
like we treat non-crashing failures during job commit today, i.e.: it should 
fail the job without re-running and re-committing.  Worst-case we have a false 
negative where the output did commit successfully but we thought the job 
failed, and I agree with Koji that a false negative beats a false positive in 
this case.

This means we need a marker noting when we start and stop committing sync'd to 
the job history file.  If the AM relaunches and finds we crashed during commit, 
we should treat it as we do a committer failure and fail the job.  If the 
re-attempt finds we finished committing then we simply need to unregister from 
the RM without re-running.
                
> AM can rerun job after reporting final job status to the client
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-4819
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Bikas Saha
>            Priority: Critical
>         Attachments: MAPREDUCE-4819.1.patch
>
>
> If the AM reports final job status to the client but then crashes before 
> unregistering with the RM then the RM can run another AM attempt.  Currently 
> AM re-attempts assume that the previous attempts did not reach a final job 
> state, and that causes the job to rerun (from scratch, if the output format 
> doesn't support recovery).
> Re-running the job when we've already told the client the final status of the 
> job is bad for a number of reasons.  If the job failed, it's confusing at 
> best since the client was already told the job failed but the subsequent 
> attempt could succeed.  If the job succeeded there could be data loss, as a 
> subsequent job launched by the client tries to consume the job's output as 
> input just as the re-attempt starts removing output files in preparation for 
> the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

Reply via email to