[
https://issues.apache.org/jira/browse/HADOOP-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703048#action_12703048
]
Andy Konwinski commented on HADOOP-2141:
----------------------------------------
Hi Devaraj, Thanks for the code review. I have a lot of your comments
implemented already and am working on the more significant ones still, should
have a new patch ready in the next 2 to 3 days. Before then, though I wanted to
post an update on my progress and respond to some of your suggestions to allow
for wider input. Patch to come soon!
1. Done
2. The problem with this is that we want to be able to identify laggard tasks
even when they are not reporting progress. I.e. if we don't get a TaskStatus
update for that task from the TT (perhaps because the TT is down, or the task
is hanging) we want it to appear slower as time goes on from the JT's
perspective.
3. Currently it is a field because I am only recalculating the task tracker
ordering every SLOW_TRACKER_SORT_DELAY minutes (currently set to 2 min) so we
have to keep the progress rate scores around between those sorts. However,
since I'm rewriting isSlowTracker() anyway (see 6 below) this is no longer
relevant.
4. Done
5. Done
6. I've spoken with Matei about this and we've decided that the mean and
variance (i.e. is the ave ProgressRate of tasks that finished on the tracker
less than a standard deviation below the ave ProgressRate of tasks on other
trackers) to determine if a TaskTracker is slow is much better than using a
percentile. The current plan is to create new class, DataStatistics, used to
track statistics for a set of numbers (by storing count, sum, and sum of
squares). DataStatistics will provide mean() and std() functions. The object
will be used at two levels:
* a field of JobInProgress, taskStats, for tracking stats of all tasks
* a map field of JobInProgress, trackerStats, with key of type TaskTracker
name, value of type DataStatistics
Updating of the statistics data structures above will be a constant time
operation done when TaskTrackers report tasks as complete.
All of this makes isSlowTracker() really simple. Basically it consists of:
if (trackerStats.get(taskTracker).mean() < taskStats.mean() - taskStats.std())
{ return true; }
7. Let's use a percentage instead (10%?)
-------
One other comment: while discussing 2 above with Devaraj and Matei, we think it
is important to more closely consider the mechanism used to calculate a task's
progress rate. The mechanism we're using in the patch so far (i.e., using
task's (progress/currentTime - startTime)) which can be seen in
TaskStatus.updateProgressRate, might be improved by looking more closely at how
to normalize the amount of time the task has been running by the amount of data
it has processed (potentially phase-wise). When Matei and I wrote the original
LATE paper, we didn't dig very deep into the task progress reporting
mechanisms, but rather just used the progress as it was reported, while making
note of some of the oddities re. the three phases. I am still trying to
validate for myself how closely the progress as reported by tasks to the
TaskTracker reflects the amount of data processed thus far. However, pending a
deeper look into this, it might be advantageous to revisit the progressRate
mechanism after we commit a simple version of the patch which uses progressRate
as is (assuming that testing at scale shows performance improvements).
Again, the patch will be up in the next few of days.
Andy
> speculative execution start up condition based on completion time
> -----------------------------------------------------------------
>
> Key: HADOOP-2141
> URL: https://issues.apache.org/jira/browse/HADOOP-2141
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.21.0
> Reporter: Koji Noguchi
> Assignee: Andy Konwinski
> Attachments: 2141.patch, HADOOP-2141-v2.patch, HADOOP-2141-v3.patch,
> HADOOP-2141-v4.patch, HADOOP-2141-v5.patch, HADOOP-2141-v6.patch,
> HADOOP-2141.patch
>
>
> We had one job with speculative execution hang.
> 4 reduce tasks were stuck with 95% completion because of a bad disk.
> Devaraj pointed out
> bq . One of the conditions that must be met for launching a speculative
> instance of a task is that it must be at least 20% behind the average
> progress, and this is not true here.
> It would be nice if speculative execution also starts up when tasks stop
> making progress.
> Devaraj suggested
> bq. Maybe, we should introduce a condition for average completion time for
> tasks in the speculative execution check.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.