[
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506594#comment-13506594
]
Bikas Saha commented on MAPREDUCE-4819:
---------------------------------------
I am not quite clear why the commit would be repeated if the job does not
execute any task at all?
As far as I understand from the comments (I havent looked at the code), the
commit code seems to be user pluggable code. In that case, how can we ensure
that every commit implementation can be made into a singleton operation? Can it
be as simple as a committer refusing to commit if the output file already
exists? Are committers allowed to delete an output file if it exists? In that
case how does it differentiate between a checkpointed commit from a previous
crashed run vs an old commit from a successful job?
On a side note, we should be encouraging projects that depend on output markers
for job completion polling, to stop doing that and start using API's. Perhaps
in the next version change. Continuing to support these kind of use cases could
make solutions more complex and fragile than they need to be.
> AM can rerun job after reporting final job status to the client
> ---------------------------------------------------------------
>
> Key: MAPREDUCE-4819
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mr-am
> Affects Versions: 0.23.3, 2.0.1-alpha
> Reporter: Jason Lowe
> Assignee: Bikas Saha
> Priority: Critical
> Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch
>
>
> If the AM reports final job status to the client but then crashes before
> unregistering with the RM then the RM can run another AM attempt. Currently
> AM re-attempts assume that the previous attempts did not reach a final job
> state, and that causes the job to rerun (from scratch, if the output format
> doesn't support recovery).
> Re-running the job when we've already told the client the final status of the
> job is bad for a number of reasons. If the job failed, it's confusing at
> best since the client was already told the job failed but the subsequent
> attempt could succeed. If the job succeeded there could be data loss, as a
> subsequent job launched by the client tries to consume the job's output as
> input just as the re-attempt starts removing output files in preparation for
> the output commit.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira