[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13250355#comment-13250355
 ] 

Bikas Saha commented on MAPREDUCE-4128:
---------------------------------------

The current recovery mechanism seems to be designed to recover completed tasks. 
Hence it assumes that all attempts of such tasks would also be complete. So it 
loads completed tasks and replays them till the replay completes all attempts 
of all completed tasks.
This will break whenever there is an attempt running after a previously 
successful completion because the replay will not have info to correctly work 
on running attempts.
Scenario 1 : MAPREDUCE-3921 introduces such an instance because it re-runs 
successful map tasks if the successful attempts had run on a bad machine.
Scenario 2 : Even in the current code, when a successful map is rerun because 
of too many fetch failures, the above scenario is produced and caused a failure 
in recovery.
The proposed solution in the patch is to make sure that if a task is re-run 
then it is not marked as completed during recovery. The JobHistoryParser has 
been changed to remove the "SUCCEEDED" status on a task if the successful 
attempt of that task later reports a failure. This fixes the repro case 
mentioned above. I have improved that testcase to cover Scenario 1. Scenario 2 
will be covered in MAPREDUCE-3921.
I am expecting the patch to introduce 3 additional warnings because of raw 
types in event handling (similar to existing warnings).
I compiled a broken rumen class and test assuming the new field added to 
TaskFinishedEvent is not relevant to them.

                
> AM Recovery expects all attempts of a completed task to also be completed.
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4128
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4128
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 3.0.0
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>             Fix For: 3.0.0
>
>         Attachments: MAPREDUCE-4128.patch
>
>
> The AM seems to assume that all attempts of a completed task (from a previous 
> AM incarnation) would also be completed. There is at least one case in which 
> this does not hold. Case being cancellation of a completed task resulting in 
> a new running attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to