[
https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582325#action_12582325
]
Amar Kamat commented on HADOOP-2175:
------------------------------------
The only concern is when all the maps that are yet to be fetched are from the
same blacklisted tracker. The reason being that each of the reducer will fetch
one map per host. Hence killing all the maps will take
{{5min * num-maps-on-tracker/num-reducers}} in the best case and {{5min *
num-maps-on-tracker}} in the worst case assuming default config.
Following are some of the tweaks
1) Keep track of the total failures registered against the tracker (per job)
and kill all the maps for a job if the total failures for a job is greater than
25% .
2) Keep a set of unique hosts per job that have registered against a
blacklisted tracker and kill all the maps for a job if all the reducers have
complained against the blacklisted tracker.
Currently we do similar stuff for killing a map based on fetch failures. We
should do something similar in case of trackers i.e re-schedule all the maps
(per job maybe) in case of blacklisted trackers. In future we may relax the
condition of the tracker being blacklisted. Thoughts?
> Blacklisted hosts may not be able to serve map outputs
> ------------------------------------------------------
>
> Key: HADOOP-2175
> URL: https://issues.apache.org/jira/browse/HADOOP-2175
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Reporter: Runping Qi
> Assignee: Amar Kamat
> Fix For: 0.17.0
>
> Attachments: HADOOP-2175-v1.1.patch, HADOOP-2175-v1.patch,
> HADOOP-2175-v2.patch, HADOOP-2175-v2.patch
>
>
> After a node fails 4 mappers (tasks), it is added to blacklist thus it will
> no longer accept tasks.
> But, it will continue serve the map outputs of any mappers that ran
> successfully there.
> However, the node may not be able serve the map outputs either.
> This will cause the reducers to mark the corresponding map outputs as from
> slow hosts,
> but continue to try to get the map outputs from that node.
> This may lead to waiting forever.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.