[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13255649#comment-13255649
 ] 

Jason Lowe commented on MAPREDUCE-4144:
---------------------------------------

Is the concern that with this change we won't remove the reservation or 
NODE_LOCAL request?  This could still have happened in the case where the node 
doesn't free up sufficient resources before the application ends up finishing 
with containers on other nodes.  Assuming the app doesn't complete first, I 
think the reservation will be cleaned up in assignReservedContainer() either 
because there are no more outstanding requests at the same priority or it will 
fill the reservation with an ANY request (since we know there aren't any more 
RACK_LOCAL requests in this scenario).

But I might be misreading the code.  If it's critical to allocate the reserved 
container as NODE_LOCAL once the node has enough free resources, we can undo 
this fix and put the rackLocal null check in 
AppSchedulingInfo.allocateNodeLocal.
                
> ResourceManager NPE while handling NODE_UPDATE
> ----------------------------------------------
>
>                 Key: MAPREDUCE-4144
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4144
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>             Fix For: 0.23.3
>
>         Attachments: MAPREDUCE-4144-testcase.patch, MAPREDUCE-4144.patch
>
>
> The RM on one of our clusters has exited twice in the past few days because 
> of an NPE while trying to handle a NODE_UPDATE:
> {noformat}
> 2012-04-12 02:09:01,672 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type NODE_UPDATE to the scheduler
>  [ResourceManager Event Processor]java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocateNodeLocal(AppSchedulingInfo.java:261)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:223)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApp.allocate(SchedulerApp.java:246)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1229)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignNodeLocalContainers(LeafQueue.java:1078)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1048)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignReservedContainer(LeafQueue.java:859)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:756)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:573)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:622)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:78)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:302)
>         at java.lang.Thread.run(Thread.java:619)
> {noformat}
> This is very similar to the failure reported in MAPREDUCE-3005.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to