[jira] Commented: (HADOOP-4716) testRestartWithLostTracker frequently times out

Amar Kamat (JIRA) Mon, 01 Dec 2008 20:30:09 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652267#action_12652267
 ]


Amar Kamat commented on HADOOP-4716:
------------------------------------

Also, the reducers build up a list of known map outputs from the map completion 
events obtained from the task-tracker. Upon restart (HADOOP-3245), the reducer 
doesn't clear this list. The problem arises (on a very small cluster e.g. test 
cases) when a map completes and this completion event is not flushed to the 
history file (in buffer)  and the tracker on which it ran gets lost. In such a 
case the (new) JobTracker has no idea about the map and since the reducer's 
list is stale, it takes some time to figure out that the map location is bad 
and use the other (re-executed by the _new_ JobTracker) one.  Note than on a 
large cluster, this should not be an issue as, upon failure, new maps from 
different host will be tried and after sometime the new location for such 
(dangling) maps will be passed by the JobTracker. We have 2 choices
- stale data : This can help in cases where a tracker has not yet joined but 
the data (map's output) is still valid/available. The drawback being the case 
where a tracker is lost and data becomes unavailable. In such a case the 
location will be retried again and again until the (newly re-executed) map's 
output is pulled from some other tracker. Here the time will be wasted in 
pulling map output from a lost tracker/node and waiting for the (dangling) maps 
(from that node) to be re-executed.
- fresh data : This can help in cases where few trackers go down. The drawback 
being that trackers that are up and ready to serve the map output will be 
ignored since they are yet to join. Here the time will be wasted in waiting for 
the tracker to _formally_ join back.

> testRestartWithLostTracker frequently times out
> -----------------------------------------------
>
>                 Key: HADOOP-4716
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4716
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Johan Oskarsson
>            Assignee: Amar Kamat
>            Priority: Minor
>             Fix For: 0.20.0
>
>         Attachments: log.txt
>
>
> This test frequently times out: 
> org.apache.hadoop.mapred.TestJobTrackerRestartWithLostTracker.testRestartWithLostTracker
> Example: 
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3637/testReport/org.apache.hadoop.mapred/TestJobTrackerRestartWithLostTracker/testRestartWithLostTracker/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4716) testRestartWithLostTracker frequently times out

Reply via email to