[ http://issues.apache.org/jira/browse/HADOOP-362?page=comments#action_12422701 ] Devaraj Das commented on HADOOP-362: ------------------------------------
Discovered a minor problem with this patch which caused the status updates to never happen on the job status web page. The call to recomputeProgress for a particular task is conditioned on changedProgress which is true only when at least 0.01 units of progress (in the range 0.0 - 1.0) is seen since the last time progress was reported (by any tasktracker). Changing this figure to 0.00001 solved the problem. > tasks can get lost when reporting task completion to the JobTracker has an > error > -------------------------------------------------------------------------------- > > Key: HADOOP-362 > URL: http://issues.apache.org/jira/browse/HADOOP-362 > Project: Hadoop > Issue Type: Bug > Components: mapred > Reporter: Devaraj Das > Assigned To: Devaraj Das > Attachments: lost-status-updates.patch > > > Basically, the JobTracker used to lose some updates about successful map > tasks and it would assume that the tasks are still running (the old progress > report is what it used to display in the web page). Now this would cause the > reduces to also wait for the map output and they would never receive the > output. This would cause the job to appear as if it was hung. > > The following piece of code sends the status of tasks to the JobTracker: > > synchronized (this) { > for (Iterator it = runningTasks.values().iterator(); > it.hasNext(); ) { > TaskInProgress tip = (TaskInProgress) it.next(); > TaskStatus status = tip.createStatus(); > taskReports.add(status); > if (status.getRunState() != TaskStatus.RUNNING) { > if (tip.getTask().isMapTask()) { > mapTotal--; > } else { > reduceTotal--; > } > it.remove(); > } > } > } > > // > // Xmit the heartbeat > // > > TaskTrackerStatus status = > new TaskTrackerStatus(taskTrackerName, localHostname, > httpPort, taskReports, > failures); > int resultCode = jobClient.emitHeartbeat(status, justStarted); > > > Notice that the completed TIPs are removed from runningTasks data structure. > Now, if the emitHeartBeat threw an exception (if it could not communicate > with the JobTracker till the IPC timeout expires) then this update is lost. > And the next time it sends the hearbeat this completed task's status is > missing and hence the JobTracker doesn't know about this completed task. So, > one solution to this is to remove the completed TIPs from runningTasks after > emitHeartbeat returns. Here is how the new code would look like: > > > synchronized (this) { > for (Iterator it = runningTasks.values().iterator(); > it.hasNext(); ) { > TaskInProgress tip = (TaskInProgress) it.next(); > TaskStatus status = tip.createStatus(); > taskReports.add(status); > } > } > > // > // Xmit the heartbeat > // > > TaskTrackerStatus status = > new TaskTrackerStatus(taskTrackerName, localHostname, > httpPort, taskReports, > failures); > int resultCode = jobClient.emitHeartbeat(status, justStarted); > synchronized (this) { > for (Iterator it = runningTasks.values().iterator(); > it.hasNext(); ) { > TaskInProgress tip = (TaskInProgress) it.next(); > if (tip.runstate != TaskStatus.RUNNING) { > if (tip.getTask().isMapTask()) { > mapTotal--; > } else { > reduceTotal--; > } > it.remove(); > } > } > } > -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
