NPE in AM causes it to lose containers which are never returned back to RM --------------------------------------------------------------------------
Key: MAPREDUCE-2693 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Reporter: Amol Kekre Priority: Critical Fix For: 0.23.0 The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens because of these lost containers. It happens when there are blacklisted nodes at the app level in AM. A bug in AM (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the request-table. We should make sure RM also knows about this update. ======================================================================== 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20 resourceName=... numContainers=4978 #asks=5 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20 resourceName=... numContainers=4977 #asks=5 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20 resourceName=... numContainers=1540 #asks=5 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20 resourceName=... numContainers=1539 #asks=6 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.lang.NullPointerException at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220) at java.lang.Thread.run(Thread.java:619) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira