Mappers fail easily due to repeated failures
--------------------------------------------

                 Key: HADOOP-2247
                 URL: https://issues.apache.org/jira/browse/HADOOP-2247
             Project: Hadoop
          Issue Type: Bug
    Affects Versions: 0.15.0
         Environment: 1400 Node hadoop cluster
            Reporter: Srikanth Kakani


Related to HADOOP-2220, problem introduced in HADOOP-1158

At this scale hardcoding the number of fetch failures to a static number: in 
this case 3 is never going to work. Although the jobs we are running are 
loading the systems 3 failures can randomly occur within the lifetime of a map. 
Even fetching the data can cause enough load for so many failures to occur.

We believe that number of tasks and size of cluster should be taken into 
account. Based on which we believe that a ratio between total fetch attempts 
and total failed attempts should be taken into consideration.

Given our experience with a task should be declared "Too many fetch failures" 
based on:

failures > n /*could be 3*/ && (failures/total attempts) > k% /*could be 
30-40%*/

Basically the first factor is to give some headstart to the second factor, 
second factor then takes into account the cluster size and the task size.

Additionally we could take recency into account, say failures and attempts in 
last one hour. We do not want to make it too small.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to