[
https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tom White updated MAPREDUCE-4252:
---------------------------------
Attachment: MapReduce.png
MAPREDUCE-4252.patch
The problem is that a speculative task attempt can cause a previously SUCCEEDED
task to transition to SCHEDULED (or FAILED).
1. A task attempt is started.
2. A speculative task attempt for the same task is started.
3. The speculative task attempt completes and causes the task to transition to
SUCCEEDED.
4. The initial task attempt fails and causes the task to transition to
SCHEDULED.
No more task attempts are scheduled so the job never completes. I have written
a unit test to demonstrate this, which fails.
The situation occurs because TaskEventType.T_ATTEMPT_FAILED can cause a task to
transition from the SUCCEEDED state (see attached state diagram). This type is
caused by one of:
* FailedTransition - when a task attempt fails
* DeallocateContainerTransition - when a task attempt goes from
assigned/unassigned to killed/failed
* TooManyFetchFailureTransition - when a reducer fails to get the map output
Only the last one seems like a good reason to transition a task from a
previously SUCCEEDED state. The first two should leave a task in the SUCCEEDED
state, and the third should be handled with a new type
(T_ATTEMPT_FAILED_RETROSPECTIVELY or T_ATTEMPT_TOO_MANY_FETCH_FAILURE) which
transitions to SCHEDULED (or FAILED).
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
> Key: MAPREDUCE-4252
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 0.23.1
> Reporter: Tom White
> Assignee: Tom White
> Attachments: MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after
> 999 of the tasks had completed, the job is now sitting forever with 1 task
> "pending".
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira