[
https://issues.apache.org/jira/browse/HADOOP-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652267#action_12652267
]
Amar Kamat commented on HADOOP-4716:
------------------------------------
Also, the reducers build up a list of known map outputs from the map completion
events obtained from the task-tracker. Upon restart (HADOOP-3245), the reducer
doesn't clear this list. The problem arises (on a very small cluster e.g. test
cases) when a map completes and this completion event is not flushed to the
history file (in buffer) and the tracker on which it ran gets lost. In such a
case the (new) JobTracker has no idea about the map and since the reducer's
list is stale, it takes some time to figure out that the map location is bad
and use the other (re-executed by the _new_ JobTracker) one. Note than on a
large cluster, this should not be an issue as, upon failure, new maps from
different host will be tried and after sometime the new location for such
(dangling) maps will be passed by the JobTracker. We have 2 choices
- stale data : This can help in cases where a tracker has not yet joined but
the data (map's output) is still valid/available. The drawback being the case
where a tracker is lost and data becomes unavailable. In such a case the
location will be retried again and again until the (newly re-executed) map's
output is pulled from some other tracker. Here the time will be wasted in
pulling map output from a lost tracker/node and waiting for the (dangling) maps
(from that node) to be re-executed.
- fresh data : This can help in cases where few trackers go down. The drawback
being that trackers that are up and ready to serve the map output will be
ignored since they are yet to join. Here the time will be wasted in waiting for
the tracker to _formally_ join back.
> testRestartWithLostTracker frequently times out
> -----------------------------------------------
>
> Key: HADOOP-4716
> URL: https://issues.apache.org/jira/browse/HADOOP-4716
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Reporter: Johan Oskarsson
> Assignee: Amar Kamat
> Priority: Minor
> Fix For: 0.20.0
>
> Attachments: log.txt
>
>
> This test frequently times out:
> org.apache.hadoop.mapred.TestJobTrackerRestartWithLostTracker.testRestartWithLostTracker
> Example:
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3637/testReport/org.apache.hadoop.mapred/TestJobTrackerRestartWithLostTracker/testRestartWithLostTracker/
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.