[ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127994#comment-13127994 ]
Hadoop QA commented on MAPREDUCE-2693: -------------------------------------- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12499113/MR-2693.1.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1030//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1030//console This message is automatically generated. > NPE in AM causes it to lose containers which are never returned back to RM > -------------------------------------------------------------------------- > > Key: MAPREDUCE-2693 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Reporter: Amol Kekre > Assignee: Hitesh Shah > Priority: Critical > Fix For: 0.23.0 > > Attachments: MR-2693.1.patch > > > The following exception in AM of an application at the top of queue causes > this. Once this happens, AM keeps obtaining > containers from RM and simply loses them. Eventually on a cluster with > multiple jobs, no more scheduling happens > because of these lost containers. > It happens when there are blacklisted nodes at the app level in AM. A bug in > AM > (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - > nodes are simply getting removed from the > request-table. We should make sure RM also knows about this update. > ======================================================================== > 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match > 98.138.163.34 > 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: > applicationId=30 priority=20 > resourceName=... numContainers=4978 #asks=5 > 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: > applicationId=30 priority=20 > resourceName=... numContainers=4977 #asks=5 > 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: > applicationId=30 priority=20 > resourceName=... numContainers=1540 #asks=5 > 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: > applicationId=30 priority=20 > resourceName=... numContainers=1539 #asks=6 > 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. > java.lang.NullPointerException > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220) > at java.lang.Thread.run(Thread.java:619) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira