[jira] [Commented] (MAPREDUCE-6302) deadlock in a job between map and reduce cores allocation

Karthik Kambatla (JIRA) Tue, 14 Apr 2015 12:40:24 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494728#comment-14494728
 ]


Karthik Kambatla commented on MAPREDUCE-6302:
---------------------------------------------

Filed YARN-3485 to fix the FairScheduler issue. In addition to that fix, I 
wonder if we should improve the MapReduce side behavior as well. 

MAPREDUCE-5844 adds the notion of "hanging" requests to kickstart preemption, 
but it appears that kicks in only when the headroom doesn't show enough 
resources to run containers. How about generalizing this to preempt containers 
in cases where there *appears* to be headroom, but the scheduler is unable to 
hand them to the app for some reason? In other words, I guess I am proposing MR 
use the headroom from YARN more as a heuristic than an absolute guarantee. MR 
should use the resources given to it in the best possible way it can.

> deadlock in a job between map and reduce cores allocation 
> ----------------------------------------------------------
>
>                 Key: MAPREDUCE-6302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: mai shurong
>            Assignee: Karthik Kambatla
>            Priority: Critical
>         Attachments: AM_log_head100000.txt.gz, AM_log_tail100000.txt.gz, 
> queue_with_max163cores.png, queue_with_max263cores.png, 
> queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6302) deadlock in a job between map and reduce cores allocation

Reply via email to