[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White updated MAPREDUCE-4252:
---------------------------------

    Attachment: MapReduce.png
                MAPREDUCE-4252.patch

The problem is that a speculative task attempt can cause a previously SUCCEEDED 
task to transition to SCHEDULED (or FAILED).

1. A task attempt is started.
2. A speculative task attempt for the same task is started.
3. The speculative task attempt completes and causes the task to transition to 
SUCCEEDED.
4. The initial task attempt fails and causes the task to transition to 
SCHEDULED.

No more task attempts are scheduled so the job never completes. I have written 
a unit test to demonstrate this, which fails.

The situation occurs because TaskEventType.T_ATTEMPT_FAILED can cause a task to 
transition from the SUCCEEDED state (see attached state diagram). This type is 
caused by one of:

* FailedTransition - when a task attempt fails 
* DeallocateContainerTransition - when a task attempt goes from 
assigned/unassigned to killed/failed
* TooManyFetchFailureTransition - when a reducer fails to get the map output

Only the last one seems like a good reason to transition a task from a 
previously SUCCEEDED state. The first two should leave a task in the SUCCEEDED 
state, and the third should be handled with a new type 
(T_ATTEMPT_FAILED_RETROSPECTIVELY or T_ATTEMPT_TOO_MANY_FETCH_FAILURE) which 
transitions to SCHEDULED (or FAILED).



                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 
> 999 of the tasks had completed, the job is now sitting forever with 1 task 
> "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to