[ https://issues.apache.org/jira/browse/MAPREDUCE-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Lowe updated MAPREDUCE-4144: ---------------------------------- Attachment: MAPREDUCE-4144-testcase.patch I think the fix for MAPREDUCE-3005 is being skipped due to reserved containers. Here's the scenario: # Application has some requests that are NODE_LOCAL and some others that are ANY. # Node A heartbeats in and we try to schedule the NODE_LOCAL request on it, but there are no available containers and instead we make a reservation. # Node B heartbeats in and it's on the same rack as Node A, so we fulfill the corresponding RACK_LOCAL request that went with Node A's NODE_LOCAL request. # Node A heartbeats in with some spare containers, and we skip the MAPREDUCE-3005 fix in canAssign() because there is a reserved container on this node. Since the RACK_LOCAL request was removed when we assigned it to Node B, we crash because we assume all NODE_LOCAL requests will have a corresponding RACK_LOCAL request. I checked the RM log above the crash, and I did find indications of container reservations being in play. For example: {noformat} [ResourceManager Event Processor]2012-04-12 02:09:01,671 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Trying to fulfill re servation for application application_1334157153376_0281 on node: xxx:8041 [ResourceManager Event Processor]2012-04-12 02:09:01,671 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApp: Application application_1334157153 376_0281 unreserved on node host: xxx:8041 #containers=3 available=7680 used=13824, currently has 70 at priority org.apache.hadoop.yarn.api .records.impl.pb.PriorityPBImpl@33; currentReservation memory: 322560 {noformat} Attached is a testcase that reproduces the NPE crash with the same backtrace. > ResourceManager NPE while handling NODE_UPDATE > ---------------------------------------------- > > Key: MAPREDUCE-4144 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4144 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Reporter: Jason Lowe > Priority: Critical > Attachments: MAPREDUCE-4144-testcase.patch > > > The RM on one of our clusters has exited twice in the past few days because > of an NPE while trying to handle a NODE_UPDATE: > {noformat} > 2012-04-12 02:09:01,672 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type NODE_UPDATE to the scheduler > [ResourceManager Event Processor]java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocateNodeLocal(AppSchedulingInfo.java:261) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:223) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApp.allocate(SchedulerApp.java:246) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1229) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignNodeLocalContainers(LeafQueue.java:1078) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1048) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignReservedContainer(LeafQueue.java:859) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:756) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:573) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:622) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:78) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:302) > at java.lang.Thread.run(Thread.java:619) > {noformat} > This is very similar to the failure reported in MAPREDUCE-3005. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira