[jira] [Moved] (MAPREDUCE-5617) map task is not re-launched when the task is failed while reducers are running with full cluster capacity - which will lead to job hang

Devaraj K (JIRA) Sun, 10 Nov 2013 22:50:23 -0800

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Devaraj K moved YARN-1396 to MAPREDUCE-5617:
--------------------------------------------

          Component/s:     (was: resourcemanager)
    Affects Version/s:     (was: 2.2.0)
                       2.2.0
                  Key: MAPREDUCE-5617  (was: YARN-1396)
              Project: Hadoop Map/Reduce  (was: Hadoop YARN)

> map task is not re-launched when the task is failed while reducers are 
> running with full cluster capacity - which will lead to job hang
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5617
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5617
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.2.0
>         Environment: SuSe Linux
>            Reporter: Sunil G
>            Priority: Critical
>
> In a Cluster with 16GB capacity, job has started with 100maps and 10 
> reducers. 
> When the reducers has started its execution, one NM has went down and 
> resulted a failure for 2 maps. But at this time, remaining 8Gb was used by 6 
> reducers and AM. So there was no place to launch the failed maps. [NM never 
> came up again, and cluster size became 8GB]
> If we kill one of reducers, then also the map cannot be launched as the 
> priority of Failed map is lesser than that of reducer. So the remaining 
> reducer only will get allocated from RM side.
> This is causing a hang for in reducer side. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Moved] (MAPREDUCE-5617) map task is not re-launched when the task is failed while reducers are running with full cluster capacity - which will lead to job hang

Reply via email to