[ https://issues.apache.org/jira/browse/HADOOP-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708781#action_12708781 ]
Devaraj Das commented on HADOOP-2141: ------------------------------------- I am going through the patch. Some early comments: 1) I don't understand the motivation of having two time fields - dispatchTime and mostRecentStartTime. Seems like both of them is updated in the same code flow - mostRecentStartTime is updated in TaskInProgress.getTaskToRun and dispatchTime is updated in the place just after assignTasks in JobTracker. But getTaskToRun is anyway called from within assignTasks .. so why have two fields representing the same information 2) The locality code seems quite redundant actually. The locality aspect actually conflicts with the algorithm for choosing tasks to speculate. In the current codebase (unpatched), we get the running tasks list based on locality w.r.t the tracker that just came in asking for a task, and then see if something can be speculatively run. In the patch, *all* running tasks are sorted globally w.r.t progress rates and expected-time-to-completion and a task from that list is handed out. Locality could only be a coincidence here at best. I will ponder some more whether to leave that code around or simplify it to remove the locality aspects for running tasks. Now, coming to Eric's concern about a slow disk slowing the progress of a task, if the speculative task also starts reading input from the same replica, then yes, there is a problem. So yes, this is an interesting area for further research! > speculative execution start up condition based on completion time > ----------------------------------------------------------------- > > Key: HADOOP-2141 > URL: https://issues.apache.org/jira/browse/HADOOP-2141 > Project: Hadoop Core > Issue Type: Improvement > Components: mapred > Affects Versions: 0.21.0 > Reporter: Koji Noguchi > Assignee: Andy Konwinski > Attachments: 2141.patch, HADOOP-2141-v2.patch, HADOOP-2141-v3.patch, > HADOOP-2141-v4.patch, HADOOP-2141-v5.patch, HADOOP-2141-v6.patch, > HADOOP-2141.patch, HADOOP-2141.v7.patch > > > We had one job with speculative execution hang. > 4 reduce tasks were stuck with 95% completion because of a bad disk. > Devaraj pointed out > bq . One of the conditions that must be met for launching a speculative > instance of a task is that it must be at least 20% behind the average > progress, and this is not true here. > It would be nice if speculative execution also starts up when tasks stop > making progress. > Devaraj suggested > bq. Maybe, we should introduce a condition for average completion time for > tasks in the speculative execution check. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.