[jira] Commented: (HADOOP-2141) speculative execution start up condition based on completion time

Andy Konwinski (JIRA) Mon, 27 Apr 2009 02:49:55 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703048#action_12703048
 ]


Andy Konwinski commented on HADOOP-2141:
----------------------------------------

Hi Devaraj, Thanks for the code review. I have a lot of your comments 
implemented already and am working on the more significant ones still, should 
have a new patch ready in the next 2 to 3 days. Before then, though I wanted to 
post an update on my progress and respond to some of your suggestions to allow 
for wider input. Patch to come soon!

1. Done

2.  The problem with this is that we want to be able to identify laggard tasks 
even when they are not reporting progress. I.e. if we don't get a TaskStatus 
update for that task from the TT (perhaps because the TT is down, or the task 
is hanging) we want it to appear slower as time goes on from the JT's 
perspective.

3. Currently it is a field because I am only recalculating the task tracker 
ordering every SLOW_TRACKER_SORT_DELAY minutes (currently set to 2 min) so we 
have to keep the progress rate scores around between those sorts. However, 
since I'm rewriting isSlowTracker() anyway (see 6 below) this is no longer 
relevant.

4. Done

5. Done

6. I've spoken with Matei about this and we've decided that the mean and 
variance (i.e. is the ave ProgressRate of tasks that finished on the tracker 
less than a standard deviation below the ave ProgressRate of tasks on other 
trackers) to determine if a TaskTracker is slow is much better than using a 
percentile. The current plan is to create new class, DataStatistics, used to 
track statistics for a set of numbers (by storing count, sum, and sum of 
squares). DataStatistics will provide mean() and std() functions. The object 
will be used at two levels:
* a field of JobInProgress, taskStats, for tracking stats of all tasks
* a map field of JobInProgress, trackerStats, with key of type TaskTracker 
name, value of type DataStatistics

Updating of the statistics data structures above will be a constant time 
operation done when TaskTrackers report tasks as complete.

All of this makes isSlowTracker() really simple. Basically it consists of:
if (trackerStats.get(taskTracker).mean() < taskStats.mean() - taskStats.std()) 
{ return true; }

7. Let's use a percentage instead (10%?)

-------
One other comment: while discussing 2 above with Devaraj and Matei, we think it 
is important to more closely consider the mechanism used to calculate a task's 
progress rate. The mechanism we're using in the patch so far (i.e., using 
task's (progress/currentTime - startTime)) which can be seen in 
TaskStatus.updateProgressRate, might be improved by looking more closely at how 
to normalize the amount of time the task has been running by the amount of data 
it has processed (potentially phase-wise). When Matei and I wrote the original 
LATE paper, we didn't dig very deep into the task progress reporting 
mechanisms, but rather just used the progress as it was reported, while making 
note of some of the oddities re. the three phases. I am still trying to 
validate for myself how closely the progress as reported by tasks to the 
TaskTracker reflects the amount of data processed thus far. However, pending a 
deeper look into this, it might be advantageous to revisit the progressRate 
mechanism after we commit a simple version of the patch which uses progressRate 
as is (assuming that testing at scale shows performance improvements). 

Again, the patch will be up in the next few of days.
Andy

> speculative execution start up condition based on completion time
> -----------------------------------------------------------------
>
>                 Key: HADOOP-2141
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2141
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.21.0
>            Reporter: Koji Noguchi
>            Assignee: Andy Konwinski
>         Attachments: 2141.patch, HADOOP-2141-v2.patch, HADOOP-2141-v3.patch, 
> HADOOP-2141-v4.patch, HADOOP-2141-v5.patch, HADOOP-2141-v6.patch, 
> HADOOP-2141.patch
>
>
> We had one job with speculative execution hang.
> 4 reduce tasks were stuck with 95% completion because of a bad disk. 
> Devaraj pointed out 
> bq . One of the conditions that must be met for launching a speculative 
> instance of a task is that it must be at least 20% behind the average 
> progress, and this is not true here.
> It would be nice if speculative execution also starts up when tasks stop 
> making progress.
> Devaraj suggested 
> bq. Maybe, we should introduce a condition for average completion time for 
> tasks in the speculative execution check. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2141) speculative execution start up condition based on completion time

Reply via email to