[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274804#comment-13274804
 ] 

Bikas Saha commented on MAPREDUCE-4252:
---------------------------------------

I hesitate at the idea of introducing more events, unless absolutely necessary. 
FAILED seems to be a good enough event to inform the task that an attempt has 
failed. FAIL_FETCH_FAILURE now means for any other kind of failure in the 
future I would have to introduce more events and introduce more arcs in the 
state machine. Large state machines are hard to understand. 

While the change itself seems correct, I am not able to convince myself that it 
is the best fix.

What surprises me is that the success of an attempt did not end up terminating 
the concurrent attempts. I would expect the speculative attempt to be Killed 
during the call to AttemptTransitionSucceeded(). Did that not work?

Or was there a close race condition that the speculative task failed at the 
same time as the succeeded attempt completed? So the KILLED event from the 
SUCCEEDED task raced against the FAILED event in the TaskAttemptImpl state 
machine, with the FAILED event winning?

What do you think about the following approach? Allow 
MapRetroActiveFailureTransition to return SUCCEEDED as a possible state. In 
that transition, if the failed attempt is not the same as the attempt that 
SUCCEEDED, then that failure would not change the state of the Task. Task would 
remain in succeeded state. We can rename MapRetroActiveFailureTransition to 
something more appropriate.

                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 
> 999 of the tasks had completed, the job is now sitting forever with 1 task 
> "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to