[ 
https://issues.apache.org/jira/browse/HADOOP-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698751#action_12698751
 ] 

Devaraj Das commented on HADOOP-2141:
-------------------------------------

Ok some more points:
1. The method TaskInProgress.canBeSpeculated should include a check for 
!skipping (as in the trunk's TaskInProgress.hasSpeculativeTask)

2. The progressRate in the TaskStatus is computed by the JobTracker. But the 
startTime in that computation is what the TaskTracker saw. This is a potential 
problem if the clocks in the cluster nodes are not synchronized. I am wondering 
whether it makes sense to have the TaskTrackers compute progress rates and let 
the JT know. Also, there is a bug for the very first status report, 
updateProgress would be called without a valid startTime (since startTime is 
read later on in TaskStatus.readFields)

3. Why does progRateScores in isSlowTracker need to be a JobInProgress field 
(as opposed to it being a local variable)?

4. The init for the speculative thresholds can go to a separate method and the 
two JobInProgress constructors can invoke that.

5. You should change JobInProgress.scheduleMap to not create running map caches 
for each level. Instead just have one flat list for running tasks (earlier 
findSpeculativeTask would try to schedule tasks based on locality but it makes 
sense to ignore locality for speculative tasks since they will be very few). 
That will mean that you also change your new method getSpeculativeMap to not do 
lookup on the cache and instead just look at this list. Furthermore, the 
retireMap method need to be changed in a similar way. Overall, these changes 
will bring the handling of running-map-tasks close to how running-reduces is 
handled (runningReduces is the list).

6. The Collection<TaskInProgress> that isSlowTracker is passed will be only the 
running TIPs. But isSlowTracker should take into account the TIPs that ran 
successfully as well (and in the implementation of isSlowTracker, you indeed 
make the check 'state == TaskStatus.State.SUCCEEDED' ). In order to implement 
that, you probably need to maintain a global HashMap of TT -> 
progress-rate-of-completed-tasks in the JobInProgress. The HashMap should be 
updated with the progressrate whenever a task sucessfully completes (just like 
the updates to the entries in progRateScores for running tasks). It also means 
that you need to have a global completedTaskCount (similar to taskCount for 
running tasks).

7. findSpeculativeTask in the patch returns null when the list has fewer items 
than MIN_TASKS_FOR_SPECULATIVE_FILTER (which is hardcoded to 10). The corner 
case here is if a job has only that many tasks to run in total, no speculative 
tasks would be run at all. I wonder whether we should have that check at all..


> speculative execution start up condition based on completion time
> -----------------------------------------------------------------
>
>                 Key: HADOOP-2141
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2141
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.21.0
>            Reporter: Koji Noguchi
>            Assignee: Andy Konwinski
>         Attachments: 2141.patch, HADOOP-2141-v2.patch, HADOOP-2141-v3.patch, 
> HADOOP-2141-v4.patch, HADOOP-2141-v5.patch, HADOOP-2141-v6.patch, 
> HADOOP-2141.patch
>
>
> We had one job with speculative execution hang.
> 4 reduce tasks were stuck with 95% completion because of a bad disk. 
> Devaraj pointed out 
> bq . One of the conditions that must be met for launching a speculative 
> instance of a task is that it must be at least 20% behind the average 
> progress, and this is not true here.
> It would be nice if speculative execution also starts up when tasks stop 
> making progress.
> Devaraj suggested 
> bq. Maybe, we should introduce a condition for average completion time for 
> tasks in the speculative execution check. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to