[
https://issues.apache.org/jira/browse/HADOOP-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697724#action_12697724
]
Devaraj Das commented on HADOOP-2141:
-------------------------------------
Andy, the current patch doesn't apply unless the the fuzz factor is set to 3 -
"patch -p0 -F 3 < HADOOP-2141-v6.patch". There is a NPE, and you can
reproduce that if you run the test TestMiniMRDFSSort - "ant
-Dtestcase=TestMiniMRDFSSort test -Dtest.output=yes", in the heartbeat method
and the test never comes out since the TTs continues to resend the heartbeat
forever. The NPE comes from isSlowTracker method. Looking more closely at the
isSlowTracker method, i think that requires some rework. The isSlowTracker
method currently looks at progress rates of only the running TIPs (although you
do a check for TaskStatus.State.SUCCEEDED but this would be always false for
RUNNING tips, and that is what is passed to the method) and attaches that to
the TaskTrackers that are running them. But wouldn't you want to look at the
history, i.e., successful TIPs that ran on the TaskTrackers.
I am thinking that it would make sense to give one credit to a TT upon running
a task successfully and base isSlowTracker purely on that (rather than the
running tasks).. That way, even the TT's progress can be maintained inline and
you wouldn't have to iterate over the running TIPs and compute that upon a TT
heartbeat.. Thoughts?
> speculative execution start up condition based on completion time
> -----------------------------------------------------------------
>
> Key: HADOOP-2141
> URL: https://issues.apache.org/jira/browse/HADOOP-2141
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.21.0
> Reporter: Koji Noguchi
> Assignee: Andy Konwinski
> Attachments: 2141.patch, HADOOP-2141-v2.patch, HADOOP-2141-v3.patch,
> HADOOP-2141-v4.patch, HADOOP-2141-v5.patch, HADOOP-2141-v6.patch,
> HADOOP-2141.patch
>
>
> We had one job with speculative execution hang.
> 4 reduce tasks were stuck with 95% completion because of a bad disk.
> Devaraj pointed out
> bq . One of the conditions that must be met for launching a speculative
> instance of a task is that it must be at least 20% behind the average
> progress, and this is not true here.
> It would be nice if speculative execution also starts up when tasks stop
> making progress.
> Devaraj suggested
> bq. Maybe, we should introduce a condition for average completion time for
> tasks in the speculative execution check.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.