[jira] Updated: (HADOOP-2141) speculative execution start up condition based on completion time

Andy Konwinski (JIRA) Fri, 20 Feb 2009 01:49:26 -0800

     [ 
https://issues.apache.org/jira/browse/HADOOP-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andy Konwinski updated HADOOP-2141:
-----------------------------------

    Attachment: HADOOP-2141-v5.patch

First, I found a significant bug in the current patch in the logic of 
isSlowTracker() that turns sums of each taskTrackers tasks to averages. This 
attached updated patch contains the bug fix.

Devaraj, regarding your suggestion of removing countSpeculating() in favor of 
having class fields which maintain counts of running speculative map tasks and 
reduces, I agree that this might perform better, and it is easy to increment 
this variable in the correct spot (i.e. in getSpeculativeMap() and 
getSpeculativeReduce()), however it isn't as clear where to decrement the 
counts. They need to be decremented when a speculative task is killed or 
completes, and the code that manages this state transition seems to be 
convoluted since there are a number of scenarios that are handled (failed task 
trackers, speculative task attempt succeeds, speculative attempt is killed 
because original attempt succeeds). I am getting a little lost digging through 
the code trying to figure out where these variables would need to be 
decremented at. There is a comment in JobInProgress.completedTask() that says 
"TaskCommitThread in the JobTracker marks other, completed, speculative tasks 
as _complete_." but I can't find the TaskCommitThread that it references and I 
don't think that just adjusting the counts when speculative tasks complete (as 
opposed to being killed or failing) would be enough. My vote is that we put 
this off for now.

Regarding modifications to keep the sorted list of candidates around, one 
potential problem I see with this is if a task that is cached in the sorted 
list of tips we are keeping around finishes before we recompute the sorted list 
again, then there would be a possibility of speculating a task that has already 
completed.

I have implemented your suggestion to keep a list of task trackers around, and 
have set the time to 2 minutes (using the SLOW_TRACKER_SORT_DELAY constant).

One thing that I think is important is to test the effects of this patch on 
MapReduce performance since a lot of the code base has changed and also this 
patch is quite different than the one we used for the experiments in the OSDI 
paper.

Finally, I wanted to double check with Devaraj that you didn't add any new 
functionality or bug fixes in your patch, but instead that it was just merging 
with trunk (and putting the default values for the parameters in 
mapred-default.xml instead of hadoop-default.xml). In particular I noticed some 
properties that your patch adds to mapred-default.xml that don't seem to be 
related to this JIRA or used in the rest of the patch (e.g. 
mapred.shuffle.maxFetchPerHost). Were these included intentionally?

> speculative execution start up condition based on completion time
> -----------------------------------------------------------------
>
>                 Key: HADOOP-2141
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2141
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Koji Noguchi
>            Assignee: Andy Konwinski
>         Attachments: 2141.patch, HADOOP-2141-v2.patch, HADOOP-2141-v3.patch, 
> HADOOP-2141-v4.patch, HADOOP-2141-v5.patch, HADOOP-2141.patch
>
>
> We had one job with speculative execution hang.
> 4 reduce tasks were stuck with 95% completion because of a bad disk. 
> Devaraj pointed out 
> bq . One of the conditions that must be met for launching a speculative 
> instance of a task is that it must be at least 20% behind the average 
> progress, and this is not true here.
> It would be nice if speculative execution also starts up when tasks stop 
> making progress.
> Devaraj suggested 
> bq. Maybe, we should introduce a condition for average completion time for 
> tasks in the speculative execution check. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2141) speculative execution start up condition based on completion time

Reply via email to