Reducers hang in SHUFFLING phase due to duplicate completed tasks in
TaskTracker.FetchStatus.allMapEvents
---------------------------------------------------------------------------------------------------------
Key: HADOOP-4360
URL: https://issues.apache.org/jira/browse/HADOOP-4360
Project: Hadoop Core
Issue Type: Bug
Components: mapred
Affects Versions: 0.17.2
Reporter: Zheng Shao
On our cluster we have seen JobTracker went to a weird state that a lot of
TaskTrackers are getting duplicate entries in
TaskTracker.FetchStatus.allMapEvents.
Since TaskTracker fetches new completed map tasks using the size of the
allMapEvents as starting index, this prohibits the tasktracker from getting all
completed map tasks. And as a result, reducer just hangs in the shuffling
status.
The problem does not get fixed by killing and restarting TaskTracker, and when
it happens a lot of TaskTrackers will show the same problem.
It seems some problems happen to the communication between JobTracker and
TaskTracker.
An easy preventive fix will be to include the starting index of the list of
completed map tasks from JobTracker to TaskTracker, so that TaskTracker can
just throw away the data if the starting index does not match the current size
of the array.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.