[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504784#comment-13504784
 ] 

Robert Joseph Evans commented on MAPREDUCE-4819:
------------------------------------------------

We are informing several different actors of "success/failure" in many 
different ways.

# _SUCCESS file being written to HDFS by the output committer as part of 
commitJob()
# job end notification by hitting an http server
# client being informed through RPC
# history server being informed by placing the log in a directory it can see
# resource manager being informed that the application is done

Some of these are much more important to report then others, but either way we 
still have at a minimum two different things that need to be tied together the 
commitJob and informing the RM not to run us again.  Rearranging the order of 
them will not fix the fact that after commitJob() finishes there is the 
possibility that something will fail but must not fail the job.  We really need 
to have a two phase commit in the job history file. 

I am about to commit the job output.
commitJob()
I finished committing the job output successfully. 

Without this there will always be the possibility that commitJob will be called 
twice, which would result in changes to the output directory. I would argue too 
that some of these are important enough that we consider reporting them twice 
and updating the listener to handle double reporting.  Like informing the 
history server about the job finishing.  Others it is not so critical, like job 
end notification or client RPC.

Koji,

I get that we want to reduce the risk of a user shooting themselves in the 
foot, but the file must be stored in a user accessible location because the 
entire job is run as the user.  It is stored under the .staging directory which 
if the user deletes will cause many other problems already and probably cause 
the job to fail.  We can try to set it up so that if the previous job history 
file does not exist on any app attempt but the first one we fail fast.  That 
would prevent retries from messing up the output directory.
                
> AM can rerun job after reporting final job status to the client
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-4819
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Bikas Saha
>            Priority: Critical
>
> If the AM reports final job status to the client but then crashes before 
> unregistering with the RM then the RM can run another AM attempt.  Currently 
> AM re-attempts assume that the previous attempts did not reach a final job 
> state, and that causes the job to rerun (from scratch, if the output format 
> doesn't support recovery).
> Re-running the job when we've already told the client the final status of the 
> job is bad for a number of reasons.  If the job failed, it's confusing at 
> best since the client was already told the job failed but the subsequent 
> attempt could succeed.  If the job succeeded there could be data loss, as a 
> subsequent job launched by the client tries to consume the job's output as 
> input just as the re-attempt starts removing output files in preparation for 
> the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to