NPE in AM causes it to lose containers which are never returned back to RM
--------------------------------------------------------------------------

                 Key: MAPREDUCE-2693
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mrv2
            Reporter: Amol Kekre
            Priority: Critical
             Fix For: 0.23.0


The following exception in AM of an application at the top of queue causes 
this. Once this happens, AM keeps obtaining
containers from RM and simply loses them. Eventually on a cluster with multiple 
jobs, no more scheduling happens
because of these lost containers.

It happens when there are blacklisted nodes at the app level in AM. A bug in AM
(RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes 
are simply getting removed from the
request-table. We should make sure RM also knows about this update.

========================================================================
11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 
98.138.163.34
11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: 
applicationId=30 priority=20
resourceName=... numContainers=4978 #asks=5
11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: 
applicationId=30 priority=20
resourceName=... numContainers=4977 #asks=5
11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: 
applicationId=30 priority=20
resourceName=... numContainers=1540 #asks=5
11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: 
applicationId=30 priority=20
resourceName=... numContainers=1539 #asks=6
11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
java.lang.NullPointerException
        at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
        at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
        at
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
        at
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
        at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
        at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
        at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to