[ 
https://issues.apache.org/jira/browse/HADOOP-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12551491
 ] 

Amar Kamat commented on HADOOP-2247:
------------------------------------

I am submitting a common patch for HADOOP-2220 and HADOOP-2247 since the 
combined effect of the strategy for map-kill and reducer-kill is what is 
desired. Following are the things that this patch proposes to change
- _Map Killing_ : Following are the conditions that will now determine the 
killing of a map
  1. {{num-fetch-fail-notifications >= 3}}
  2. ({{num-fetch-fail-notifications/num-reducers) >= max-allowed}}, here 
{{max-allowed = 0.5}}
- _Reducer Killing_ :  Following are the conditions that will now determine the 
killing of a reducer
  1. {{num-unique-failures >= 5}}
  2. {{num-failed-attempt/num-attempts >= max-allowed}}, {{max-allowed = 0.5}}
  3. {{num-copied/num-maps <= min-required}}, {{num-required = 0.5}} _OR_  
{{time-without-progress >= (min-shuffle-exec-time/2)}}


Here are the details and insights for this design
- In the map case, a vote is considered before killing the map. If more than 
50% of the reducers fail to fetch the map output then the map should be 
re-executed. If some reducer continuously reports failure for a map causing the 
count to be >= num-reducers/2 also means that the map-host lately encountered a 
problem and had sufficient time to come out of it. This makes sure that the map 
is not killed too early and also that the map gets killed/re-executed sometime 
or the other.
   *_CASE_* : Consider a case where the first 2 attempts by 2 reducers result 
into fetch-failures and subsequent attempts succeed. This can cause the map to 
be re-executed if the 3rd reducer fails for the first time. This addition 
overcomes this flaw. 
- In the reducer case, _num failed attempts_, _progress made_ and _stalled 
time_ are also taken into consideration. The reason for doing this is 
  1. _num failed attempts_ : It helps in cases where the reducer fails on 
unique maps but very few times and thus give some more time to the reducer. 
  2. _progress made_ : it helps to avoid reducer killing if the reducer has 
progressed a lot and killing it would be a big overhead. 
        *_CASE_* : Consider a case where the reducer has failed on every 
attempt once before being successful. In this case, the _failure rate_ is 50%, 
_unique failures_ is also more than 3 but the progress made is more than 50%. 
So _progress made_ balances _num failed attempts_ and _unique failures_ in some 
cases.  
  3.  _stalled time_ : It helps in cases where the reducer has made a lot of 
progress but encountered a problem in the final steps. Now since the _progress 
made_ is more than 50% there should be a way to kill the reducer.  Stalled time 
is calculated based on the {{max-map-completion-time}} and {{duration of 
shuffle phase before stalling}}. So the reducer will have 
{{min-shuffle-exec-time}} as {{max(max-map-completion-time, 
duration-before-stall)}} and the reducer is considered stalled if it shows no 
progress for {{min-shuffle-exec-time/2}} amount of time.
  In the above case {{uniq-fetch-failure}} gives the head start while the 
others help maintain the balance towards the rest of the shuffle phase.
- {{max-backoff}} is now set to {{max(default-max-backoff, 
map-completion-time)}}. This allows a granular approach for map killing. So 
larger the map more the time required to kill it while faster maps will be 
killed faster. This parameter decides both the map killing and reducer killing 
(hence a common patch).
 ----
Srikanth and  Christian could you plz try this out and comment? Any comments on 
the strategy and the default %?

> Mappers fail easily due to repeated failures
> --------------------------------------------
>
>                 Key: HADOOP-2247
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2247
>             Project: Hadoop
>          Issue Type: Bug
>    Affects Versions: 0.15.0
>         Environment: 1400 Node hadoop cluster
>            Reporter: Srikanth Kakani
>            Assignee: Amar Kamat
>            Priority: Blocker
>             Fix For: 0.15.2
>
>         Attachments: HADOOP-2220.patch
>
>
> Related to HADOOP-2220, problem introduced in HADOOP-1158
> At this scale hardcoding the number of fetch failures to a static number: in 
> this case 3 is never going to work. Although the jobs we are running are 
> loading the systems 3 failures can randomly occur within the lifetime of a 
> map. Even fetching the data can cause enough load for so many failures to 
> occur.
> We believe that number of tasks and size of cluster should be taken into 
> account. Based on which we believe that a ratio between total fetch attempts 
> and total failed attempts should be taken into consideration.
> Given our experience with a task should be declared "Too many fetch failures" 
> based on:
> failures > n /*could be 3*/ && (failures/total attempts) > k% /*could be 
> 30-40%*/
> Basically the first factor is to give some headstart to the second factor, 
> second factor then takes into account the cluster size and the task size.
> Additionally we could take recency into account, say failures and attempts in 
> last one hour. We do not want to make it too small.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to