[
https://issues.apache.org/jira/browse/HADOOP-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708781#action_12708781
]
Devaraj Das commented on HADOOP-2141:
-------------------------------------
I am going through the patch. Some early comments:
1) I don't understand the motivation of having two time fields - dispatchTime
and mostRecentStartTime. Seems like both of them is updated in the same code
flow - mostRecentStartTime is updated in TaskInProgress.getTaskToRun and
dispatchTime is updated in the place just after assignTasks in JobTracker. But
getTaskToRun is anyway called from within assignTasks .. so why have two fields
representing the same information
2) The locality code seems quite redundant actually. The locality aspect
actually conflicts with the algorithm for choosing tasks to speculate. In the
current codebase (unpatched), we get the running tasks list based on locality
w.r.t the tracker that just came in asking for a task, and then see if
something can be speculatively run. In the patch, *all* running tasks are
sorted globally w.r.t progress rates and expected-time-to-completion and a task
from that list is handed out. Locality could only be a coincidence here at
best. I will ponder some more whether to leave that code around or simplify it
to remove the locality aspects for running tasks.
Now, coming to Eric's concern about a slow disk slowing the progress of a task,
if the speculative task also starts reading input from the same replica, then
yes, there is a problem. So yes, this is an interesting area for further
research!
> speculative execution start up condition based on completion time
> -----------------------------------------------------------------
>
> Key: HADOOP-2141
> URL: https://issues.apache.org/jira/browse/HADOOP-2141
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.21.0
> Reporter: Koji Noguchi
> Assignee: Andy Konwinski
> Attachments: 2141.patch, HADOOP-2141-v2.patch, HADOOP-2141-v3.patch,
> HADOOP-2141-v4.patch, HADOOP-2141-v5.patch, HADOOP-2141-v6.patch,
> HADOOP-2141.patch, HADOOP-2141.v7.patch
>
>
> We had one job with speculative execution hang.
> 4 reduce tasks were stuck with 95% completion because of a bad disk.
> Devaraj pointed out
> bq . One of the conditions that must be met for launching a speculative
> instance of a task is that it must be at least 20% behind the average
> progress, and this is not true here.
> It would be nice if speculative execution also starts up when tasks stop
> making progress.
> Devaraj suggested
> bq. Maybe, we should introduce a condition for average completion time for
> tasks in the speculative execution check.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.