[
https://issues.apache.org/jira/browse/HADOOP-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550320
]
devaraj edited comment on HADOOP-2247 at 12/10/07 9:55 PM:
---------------------------------------------------------------
I think the max backoff should be set to a high value for apps where a high
load on the cluster is expected. Apart from that, I think basing the decision
whether to send a notification to the JT on a map should be based on the ratio
of the number of failed attempts to the total number of attempts. The higher
the ratio the lesser the probability that the map is faulty. It's highly
probable that the reducer is faulty and/or the cluster is too busy ..
was (Author: devaraj):
I think the max backoff should be set to a high value for apps where a high
load on the cluster is expected. Apart from that, I think basing the decision
whether to kill a map should be based on just the ratio of failed fetches to
total number of fetches and the count of the number of failures.
> Mappers fail easily due to repeated failures
> --------------------------------------------
>
> Key: HADOOP-2247
> URL: https://issues.apache.org/jira/browse/HADOOP-2247
> Project: Hadoop
> Issue Type: Bug
> Affects Versions: 0.15.0
> Environment: 1400 Node hadoop cluster
> Reporter: Srikanth Kakani
> Priority: Blocker
> Fix For: 0.15.2
>
>
> Related to HADOOP-2220, problem introduced in HADOOP-1158
> At this scale hardcoding the number of fetch failures to a static number: in
> this case 3 is never going to work. Although the jobs we are running are
> loading the systems 3 failures can randomly occur within the lifetime of a
> map. Even fetching the data can cause enough load for so many failures to
> occur.
> We believe that number of tasks and size of cluster should be taken into
> account. Based on which we believe that a ratio between total fetch attempts
> and total failed attempts should be taken into consideration.
> Given our experience with a task should be declared "Too many fetch failures"
> based on:
> failures > n /*could be 3*/ && (failures/total attempts) > k% /*could be
> 30-40%*/
> Basically the first factor is to give some headstart to the second factor,
> second factor then takes into account the cluster size and the task size.
> Additionally we could take recency into account, say failures and attempts in
> last one hour. We do not want to make it too small.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.