[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated MAPREDUCE-2693:
-----------------------------------------------

    Status: Open  (was: Patch Available)

Sorry, took time, it's an involved change. Mostly looks good. Few comments:

RMContainerRequestor:
 - Make the constructor with event-argument invoke the other constructor.
 - {{containerFailedOnHost()}}:
   -- Do we need to remove the rack entries from ask and remoteRequestTable 
also? (The TODO at the end)
   -- Use {{BuilderUtils.newResourceRequest()}} for constructing zeroedRequest.
 - {{getFilteredContainerRequest()}}: Why look for both IP addresses and 
host-names to check if they are/aren't blacklisted?

RMContainerAllocator:
 - Checks for illegal resource size (allocated.getResource().getMemory() < 
mapResourceReqt || maps.isEmpty()) can be moved one level up from so that we 
don't need to do multiple times in both _assign()_ and 
_getContainerReqToReplace()_?
 - Log message: "Could not find a valid request to which this allocated 
container maps to". Also add that this container is going to be released?

Test: It is not clear to me why we need five iterations in that loop, is it 
possible to make it deterministic or more explicit?

What about current running tasks, do we want to kill them too if we mark the 
node for blacklisting?

General: Wrap lines longer than 80 chars, only those which the patch touches of 
course :)
                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch, MR-2693.2.patch
>
>
> The following exception in AM of an application at the top of queue causes 
> this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with 
> multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in 
> AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - 
> nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 
> 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: 
> applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: 
> applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: 
> applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: 
> applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to