[ https://issues.apache.org/jira/browse/HADOOP-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12551491 ]
Amar Kamat commented on HADOOP-2247: ------------------------------------ I am submitting a common patch for HADOOP-2220 and HADOOP-2247 since the combined effect of the strategy for map-kill and reducer-kill is what is desired. Following are the things that this patch proposes to change - _Map Killing_ : Following are the conditions that will now determine the killing of a map 1. {{num-fetch-fail-notifications >= 3}} 2. ({{num-fetch-fail-notifications/num-reducers) >= max-allowed}}, here {{max-allowed = 0.5}} - _Reducer Killing_ : Following are the conditions that will now determine the killing of a reducer 1. {{num-unique-failures >= 5}} 2. {{num-failed-attempt/num-attempts >= max-allowed}}, {{max-allowed = 0.5}} 3. {{num-copied/num-maps <= min-required}}, {{num-required = 0.5}} _OR_ {{time-without-progress >= (min-shuffle-exec-time/2)}} Here are the details and insights for this design - In the map case, a vote is considered before killing the map. If more than 50% of the reducers fail to fetch the map output then the map should be re-executed. If some reducer continuously reports failure for a map causing the count to be >= num-reducers/2 also means that the map-host lately encountered a problem and had sufficient time to come out of it. This makes sure that the map is not killed too early and also that the map gets killed/re-executed sometime or the other. *_CASE_* : Consider a case where the first 2 attempts by 2 reducers result into fetch-failures and subsequent attempts succeed. This can cause the map to be re-executed if the 3rd reducer fails for the first time. This addition overcomes this flaw. - In the reducer case, _num failed attempts_, _progress made_ and _stalled time_ are also taken into consideration. The reason for doing this is 1. _num failed attempts_ : It helps in cases where the reducer fails on unique maps but very few times and thus give some more time to the reducer. 2. _progress made_ : it helps to avoid reducer killing if the reducer has progressed a lot and killing it would be a big overhead. *_CASE_* : Consider a case where the reducer has failed on every attempt once before being successful. In this case, the _failure rate_ is 50%, _unique failures_ is also more than 3 but the progress made is more than 50%. So _progress made_ balances _num failed attempts_ and _unique failures_ in some cases. 3. _stalled time_ : It helps in cases where the reducer has made a lot of progress but encountered a problem in the final steps. Now since the _progress made_ is more than 50% there should be a way to kill the reducer. Stalled time is calculated based on the {{max-map-completion-time}} and {{duration of shuffle phase before stalling}}. So the reducer will have {{min-shuffle-exec-time}} as {{max(max-map-completion-time, duration-before-stall)}} and the reducer is considered stalled if it shows no progress for {{min-shuffle-exec-time/2}} amount of time. In the above case {{uniq-fetch-failure}} gives the head start while the others help maintain the balance towards the rest of the shuffle phase. - {{max-backoff}} is now set to {{max(default-max-backoff, map-completion-time)}}. This allows a granular approach for map killing. So larger the map more the time required to kill it while faster maps will be killed faster. This parameter decides both the map killing and reducer killing (hence a common patch). ---- Srikanth and Christian could you plz try this out and comment? Any comments on the strategy and the default %? > Mappers fail easily due to repeated failures > -------------------------------------------- > > Key: HADOOP-2247 > URL: https://issues.apache.org/jira/browse/HADOOP-2247 > Project: Hadoop > Issue Type: Bug > Affects Versions: 0.15.0 > Environment: 1400 Node hadoop cluster > Reporter: Srikanth Kakani > Assignee: Amar Kamat > Priority: Blocker > Fix For: 0.15.2 > > Attachments: HADOOP-2220.patch > > > Related to HADOOP-2220, problem introduced in HADOOP-1158 > At this scale hardcoding the number of fetch failures to a static number: in > this case 3 is never going to work. Although the jobs we are running are > loading the systems 3 failures can randomly occur within the lifetime of a > map. Even fetching the data can cause enough load for so many failures to > occur. > We believe that number of tasks and size of cluster should be taken into > account. Based on which we believe that a ratio between total fetch attempts > and total failed attempts should be taken into consideration. > Given our experience with a task should be declared "Too many fetch failures" > based on: > failures > n /*could be 3*/ && (failures/total attempts) > k% /*could be > 30-40%*/ > Basically the first factor is to give some headstart to the second factor, > second factor then takes into account the cluster size and the task size. > Additionally we could take recency into account, say failures and attempts in > last one hour. We do not want to make it too small. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.