[
https://issues.apache.org/jira/browse/HADOOP-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549651
]
Amar Kamat commented on HADOOP-2247:
------------------------------------
_THE *WAIT-KILL* DILEMMA_
Following are the issues to be considered while deciding whether a map should
be killed or not. Earlier the backoff function used to backoff by a random
amount between 1min-6min. Now after HADOOP-1984, the backoff function is
exponential in nature. The total amount to time spent by a reducer on fetching
a map output before giving up is {{max-backoff}} in total. In all ({{3*
max-backoff}}) time is required to kill a map task by a reducer. So first thing
to do is to adjust the {{mapred.reduce.max.backoff}} parameter so that the map
is not killed early. Other parameters which we are working on is as follows
* *Reducer-health* : There should a way to decide how is the reducer
performing. One such parameter is ({{num-fail-fetches/num-fetches}}). Roughly
this ratio > 50% conveys that the reducer is not performing well enough.
* *Reducer-progress* : There should a way to decide how is the reducer
progressing. One such parameter is ({{num-outputs-fetched/num-maps}}). Roughly
this ratio > 50% conveys that the reducer has made considerable progress.
* *Avg map completion time* : This time should determine when the fetch attempt
should be considered as failed hence JT should be reported.
* *Num-reducers* : The number of reducers in a particular job might provide
some insight on how the contented the resources might be. (Low the number of
reducers + failing output fetch a single map) indicate that the problem is
map-sided. If the reducer is not able to fetch any map then the problem is
reducer-sided. If there are many reducers and failures in map fetch then there
is a high chance of congestion.
One thing to notice is that
* it requires ({{max-backoff*3}}) amount of time to kill a map.
* it requires 5 minutes (in worst case) to kill a reducer when there are 5
fetches fail simultaneously.
A better strategy would be to make
* *avg-map-completion-time* as a parameter in deciding the time to report
failure. {{max-backoff}} should also be dependent on avg map completion time.
* *num-reducers* as a parameter in deciding how much to backoff and whether the
map should be killed or the reducer should backoff(wait).
* *(num-maps - num-finished)* and *(num-fetch-fail / num-fetched)* as a
parameter in deciding the time to kill the reducer. A good strategy would be to
kill a reducer if it fails to fetch output of 50% of the maps and not many map
output are fetched. It could be a case that the reducer has fetched the map
outputs but with some failures. In that case the fetch-fail ratio will be
higher but the progress will also be considerable. We don't want to penalize a
reducer which has fetched many map outputs with lot of failures.
* *ratio-based-map-killing* : JT should also kill a map based on some % along
with the hard coded number 3. For example kill a map if 50% of the reducers
report failures and num-reports >= 3. Also it might help the JT to have a
global idea of what all map-outputs are being tried so that the scheduling of
new tasks and killing of maps can be decided.
* *fetch-success event notification* : JT should be informed by a reducer about
a successful map-output-fetch event as a result of which the counters regarding
the killing of that map should be reset. In a highly congested system finding 3
reducers that fail in the first attempt for a particular map is easy.
----
Comments ?
> Mappers fail easily due to repeated failures
> --------------------------------------------
>
> Key: HADOOP-2247
> URL: https://issues.apache.org/jira/browse/HADOOP-2247
> Project: Hadoop
> Issue Type: Bug
> Affects Versions: 0.15.0
> Environment: 1400 Node hadoop cluster
> Reporter: Srikanth Kakani
> Priority: Blocker
> Fix For: 0.15.2
>
>
> Related to HADOOP-2220, problem introduced in HADOOP-1158
> At this scale hardcoding the number of fetch failures to a static number: in
> this case 3 is never going to work. Although the jobs we are running are
> loading the systems 3 failures can randomly occur within the lifetime of a
> map. Even fetching the data can cause enough load for so many failures to
> occur.
> We believe that number of tasks and size of cluster should be taken into
> account. Based on which we believe that a ratio between total fetch attempts
> and total failed attempts should be taken into consideration.
> Given our experience with a task should be declared "Too many fetch failures"
> based on:
> failures > n /*could be 3*/ && (failures/total attempts) > k% /*could be
> 30-40%*/
> Basically the first factor is to give some headstart to the second factor,
> second factor then takes into account the cluster size and the task size.
> Additionally we could take recency into account, say failures and attempts in
> last one hour. We do not want to make it too small.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.