[
https://issues.apache.org/jira/browse/MAPREDUCE-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13250355#comment-13250355
]
Bikas Saha commented on MAPREDUCE-4128:
---------------------------------------
The current recovery mechanism seems to be designed to recover completed tasks.
Hence it assumes that all attempts of such tasks would also be complete. So it
loads completed tasks and replays them till the replay completes all attempts
of all completed tasks.
This will break whenever there is an attempt running after a previously
successful completion because the replay will not have info to correctly work
on running attempts.
Scenario 1 : MAPREDUCE-3921 introduces such an instance because it re-runs
successful map tasks if the successful attempts had run on a bad machine.
Scenario 2 : Even in the current code, when a successful map is rerun because
of too many fetch failures, the above scenario is produced and caused a failure
in recovery.
The proposed solution in the patch is to make sure that if a task is re-run
then it is not marked as completed during recovery. The JobHistoryParser has
been changed to remove the "SUCCEEDED" status on a task if the successful
attempt of that task later reports a failure. This fixes the repro case
mentioned above. I have improved that testcase to cover Scenario 1. Scenario 2
will be covered in MAPREDUCE-3921.
I am expecting the patch to introduce 3 additional warnings because of raw
types in event handling (similar to existing warnings).
I compiled a broken rumen class and test assuming the new field added to
TaskFinishedEvent is not relevant to them.
> AM Recovery expects all attempts of a completed task to also be completed.
> --------------------------------------------------------------------------
>
> Key: MAPREDUCE-4128
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4128
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 3.0.0
> Reporter: Bikas Saha
> Assignee: Bikas Saha
> Fix For: 3.0.0
>
> Attachments: MAPREDUCE-4128.patch
>
>
> The AM seems to assume that all attempts of a completed task (from a previous
> AM incarnation) would also be completed. There is at least one case in which
> this does not hold. Case being cancellation of a completed task resulting in
> a new running attempt.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira