[
https://issues.apache.org/jira/browse/HADOOP-4360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637606#action_12637606
]
Zheng Shao commented on HADOOP-4360:
------------------------------------
The problem can easily happen if there are multiple requests to the job tracker
at the same time:
Let's say TaskTracker's allMapEvents contains 5 elements now, and 2 requests to
the JobTracker are started at the same time, so both requests bear a starting
index of 5.
The 2 requests will get identical replies from JobTracker. Then they are added
to the completion events separately, which caused the duplicates.
A simple "synchronized(fromEventId)" wrapper around the
"jobClient.getTaskCompletionEvents" call will avoid the problem. In case we
don't want to do that, we should at least check whether fromEventId has changed
during the RPC call.
(let me withdraw the idea of adding starting index to the job tracker reply -
that does not seem necessary)
TaskTracker.java:(from 0.17.2)
854:
private List<TaskCompletionEvent> queryJobTracker(IntWritable fromEventId,
String jobId,
InterTrackerProtocol
jobClient)
throws IOException {
TaskCompletionEvent t[] = jobClient.getTaskCompletionEvents(
jobId,
fromEventId.get(),
probe_sample_size);
//we are interested in map task completion events only. So store
//only those
List <TaskCompletionEvent> recentMapEvents =
new ArrayList<TaskCompletionEvent>();
for (int i = 0; i < t.length; i++) {
if (t[i].isMap) {
recentMapEvents.add(t[i]);
}
}
fromEventId.set(fromEventId.get() + t.length);
return recentMapEvents;
613:
public void fetchMapCompletionEvents() throws IOException {
List <TaskCompletionEvent> recentMapEvents =
queryJobTracker(fromEventId, jobId, jobClient);
synchronized (allMapEvents) {
allMapEvents.addAll(recentMapEvents);
}
}
> Reducers hang in SHUFFLING phase due to duplicate completed tasks in
> TaskTracker.FetchStatus.allMapEvents
> ---------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-4360
> URL: https://issues.apache.org/jira/browse/HADOOP-4360
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.17.2
> Reporter: Zheng Shao
>
> On our cluster we have seen JobTracker went to a weird state that a lot of
> TaskTrackers are getting duplicate entries in
> TaskTracker.FetchStatus.allMapEvents.
> Since TaskTracker fetches new completed map tasks using the size of the
> allMapEvents as starting index, this prohibits the tasktracker from getting
> all completed map tasks. And as a result, reducer just hangs in the shuffling
> status.
> The problem does not get fixed by killing and restarting TaskTracker, and
> when it happens a lot of TaskTrackers will show the same problem.
> It seems some problems happen to the communication between JobTracker and
> TaskTracker.
> An easy preventive fix will be to include the starting index of the list of
> completed map tasks from JobTracker to TaskTracker, so that TaskTracker can
> just throw away the data if the starting index does not match the current
> size of the array.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.