[
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069242#comment-13069242
]
Robert Joseph Evans commented on MAPREDUCE-2324:
------------------------------------------------
I have been looking at MR-279 and I want to do something similar it is just not
really set up to do it easily. The scheduling is split up between the resource
manager and the application master. And in fact the resource
manager/application master are completely ignoring disk utilization at this
point.
The plan is to add in disk utilization to the resources that the RM uses, and
then have AM request both disk and RAM space for reduces with disk space based
off of the size estimate currently used. Then inside the scheduler, which is
the right place in my opinion to decide if a request is being starved or not,
it would do just what we do not but more generalized for all resource
constraints, not just disk. This means that all schedulers would have to be
modified to support this, but I can make the code generic so it should be
fairly simple to do. I just need to dig into the MR-279 code to decide exactly
how I want to insert this in. I should hopefully have a patch by mid next
week.
> Job should fail if a reduce task can't be scheduled anywhere
> ------------------------------------------------------------
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 0.20.2, 0.20.205.0
> Reporter: Todd Lipcon
> Assignee: Robert Joseph Evans
> Fix For: 0.20.205.0
>
> Attachments: MR-2324-security-v1.txt, MR-2324-security-v2.txt,
> MR-2324-security-v3.patch
>
>
> If there's a reduce task that needs more disk space than is available on any
> mapred.local.dir in the cluster, that task will stay pending forever. For
> example, we produced this in a QA cluster by accidentally running terasort
> with one reducer - since no mapred.local.dir had 1T free, the job remained in
> pending state for several days. The reason for the "stuck" task wasn't clear
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs
> and finds that there isn't enough space.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira