[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Robert Joseph Evans (Commented) (JIRA) Thu, 01 Dec 2011 11:31:03 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161068#comment-13161068
 ]


Robert Joseph Evans commented on MAPREDUCE-3460:
------------------------------------------------

Sid I think I may have found a bug in the scheduler/MR-AM, but I am not really 
sure about it or not, and I would like your feedback on it.

When I run the unit test above I see the hosts(NM) are registered with the RM 
using "host:port", but when we request a container in the tests it only has 
"host" in it.  The scheduler seems to indicate that when it assigns a container 
to a host it is because it is rack local not data local.  As part of this the 
host specific request does not seem to be cleared out from the scheduler even 
though it is not part of the new ask.  If I switch it over to requesting a 
container on a particular "host:port" then the scheduler will clear find the 
container to be data local, and clear out the host, rack, and * requests.  This 
seems to work OK, but I thought when we requested a container due to data 
locality we used just the host name, because that is what HDFS returns to us.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager 
> which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end 
> up returning any of the ContainerRequests which may have requested a 
> container on this node. This container request is cleaned to remove the bad 
> node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still 
> aware of the priority=5 container (in 'remoteRequestsTable') - but this never 
> gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Reply via email to