[
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504784#comment-13504784
]
Robert Joseph Evans commented on MAPREDUCE-4819:
------------------------------------------------
We are informing several different actors of "success/failure" in many
different ways.
# _SUCCESS file being written to HDFS by the output committer as part of
commitJob()
# job end notification by hitting an http server
# client being informed through RPC
# history server being informed by placing the log in a directory it can see
# resource manager being informed that the application is done
Some of these are much more important to report then others, but either way we
still have at a minimum two different things that need to be tied together the
commitJob and informing the RM not to run us again. Rearranging the order of
them will not fix the fact that after commitJob() finishes there is the
possibility that something will fail but must not fail the job. We really need
to have a two phase commit in the job history file.
I am about to commit the job output.
commitJob()
I finished committing the job output successfully.
Without this there will always be the possibility that commitJob will be called
twice, which would result in changes to the output directory. I would argue too
that some of these are important enough that we consider reporting them twice
and updating the listener to handle double reporting. Like informing the
history server about the job finishing. Others it is not so critical, like job
end notification or client RPC.
Koji,
I get that we want to reduce the risk of a user shooting themselves in the
foot, but the file must be stored in a user accessible location because the
entire job is run as the user. It is stored under the .staging directory which
if the user deletes will cause many other problems already and probably cause
the job to fail. We can try to set it up so that if the previous job history
file does not exist on any app attempt but the first one we fail fast. That
would prevent retries from messing up the output directory.
> AM can rerun job after reporting final job status to the client
> ---------------------------------------------------------------
>
> Key: MAPREDUCE-4819
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mr-am
> Affects Versions: 0.23.3, 2.0.1-alpha
> Reporter: Jason Lowe
> Assignee: Bikas Saha
> Priority: Critical
>
> If the AM reports final job status to the client but then crashes before
> unregistering with the RM then the RM can run another AM attempt. Currently
> AM re-attempts assume that the previous attempts did not reach a final job
> state, and that causes the job to rerun (from scratch, if the output format
> doesn't support recovery).
> Re-running the job when we've already told the client the final status of the
> job is bad for a number of reasons. If the job failed, it's confusing at
> best since the client was already told the job failed but the subsequent
> attempt could succeed. If the job succeeded there could be data loss, as a
> subsequent job launched by the client tries to consume the job's output as
> input just as the re-attempt starts removing output files in preparation for
> the output commit.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira