[ 
https://issues.apache.org/jira/browse/HADOOP-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697724#action_12697724
 ] 

Devaraj Das commented on HADOOP-2141:
-------------------------------------

Andy, the current patch doesn't apply unless the the fuzz factor is set to 3 - 
"patch -p0   -F 3 < HADOOP-2141-v6.patch". There is a NPE, and you can 
reproduce that if you run the test TestMiniMRDFSSort - "ant 
-Dtestcase=TestMiniMRDFSSort test -Dtest.output=yes", in the heartbeat method 
and the test never comes out since the TTs continues to resend the heartbeat 
forever. The NPE comes from isSlowTracker method. Looking more closely at the 
isSlowTracker method, i think that requires some rework. The isSlowTracker 
method currently looks at progress rates of only the running TIPs (although you 
do a check for TaskStatus.State.SUCCEEDED but this would be always false for 
RUNNING tips, and that is what is passed to the method) and attaches that to 
the TaskTrackers that are running them. But wouldn't you want to look at the 
history, i.e., successful TIPs that ran on the TaskTrackers. 
I am thinking that it would make sense to give one credit to a TT upon running 
a task successfully and base isSlowTracker purely on that (rather than the 
running tasks).. That way, even the TT's progress can be maintained inline and 
you wouldn't have to iterate over the running TIPs and compute that upon a TT 
heartbeat.. Thoughts?

> speculative execution start up condition based on completion time
> -----------------------------------------------------------------
>
>                 Key: HADOOP-2141
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2141
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.21.0
>            Reporter: Koji Noguchi
>            Assignee: Andy Konwinski
>         Attachments: 2141.patch, HADOOP-2141-v2.patch, HADOOP-2141-v3.patch, 
> HADOOP-2141-v4.patch, HADOOP-2141-v5.patch, HADOOP-2141-v6.patch, 
> HADOOP-2141.patch
>
>
> We had one job with speculative execution hang.
> 4 reduce tasks were stuck with 95% completion because of a bad disk. 
> Devaraj pointed out 
> bq . One of the conditions that must be met for launching a speculative 
> instance of a task is that it must be at least 20% behind the average 
> progress, and this is not true here.
> It would be nice if speculative execution also starts up when tasks stop 
> making progress.
> Devaraj suggested 
> bq. Maybe, we should introduce a condition for average completion time for 
> tasks in the speculative execution check. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to