[jira] [Assigned] (MAPREDUCE-4672) RM with lost NMs results in massive log of AppAttemptId doesnt exist in cache

Vinod Kumar Vavilapalli (JIRA) Mon, 24 Sep 2012 12:19:09 -0700

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vinod Kumar Vavilapalli reassigned MAPREDUCE-4672:
--------------------------------------------------

    Assignee: Vinod Kumar Vavilapalli

In this case, you just killed the NMs and the AM is still running? In that 
case, RM realizes that the node running AM went down, assumes that AM should 
also go down and marks it as an invalid App.

It does ask the AM to shut-down as part of the ping-response but it looks like 
your AM isn't handling that correctly.
{code}
    AllocateResponse allocateResponse = recordFactory
        .newRecordInstance(AllocateResponse.class);
    AMResponse lastResponse = responseMap.get(appAttemptId);
    if (lastResponse == null) {
      LOG.error("AppAttemptId doesnt exist in cache " + appAttemptId);
      allocateResponse.setAMResponse(reboot);
      return allocateResponse;
    }
{code}

See {{AMResponse.getReboot()}}
                
> RM with lost NMs results in massive log of AppAttemptId doesnt exist in cache
> -----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4672
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4672
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 0.23.1
>            Reporter: Chris Riccomini
>            Assignee: Vinod Kumar Vavilapalli
>
> Hey Guys,
> I'm running a 9 node cluster with 8 NMs and a single RM node. If I run an app 
> master and have that app master start a container, then shut down all NMs, 
> but leave the RM up (to simulate a failure), the containers timeout and fail, 
> as expected.
> What's unexpected is that my log then starts filling with:
> 2012-09-21 18:02:02,614 ERROR resourcemanager.ApplicationMasterService 
> (ApplicationMasterService.java:allocate(247)) - AppAttemptId doesnt exist in 
> cache appattempt_1348248013002_0001_000001
> 2012-09-21 18:02:03,617 ERROR resourcemanager.ApplicationMasterService 
> (ApplicationMasterService.java:allocate(247)) - AppAttemptId doesnt exist in 
> cache appattempt_1348248013002_0001_000001
> 2012-09-21 18:02:04,618 ERROR resourcemanager.ApplicationMasterService 
> (ApplicationMasterService.java:allocate(247)) - AppAttemptId doesnt exist in 
> cache appattempt_1348248013002_0001_000001
> 2012-09-21 18:02:05,620 ERROR resourcemanager.ApplicationMasterService 
> (ApplicationMasterService.java:allocate(247)) - AppAttemptId doesnt exist in 
> cache appattempt_1348248013002_0001_000001
> 2012-09-21 18:02:06,621 ERROR resourcemanager.ApplicationMasterService 
> (ApplicationMasterService.java:allocate(247)) - AppAttemptId doesnt exist in 
> cache appattempt_1348248013002_0001_000001
> 2012-09-21 18:02:07,623 ERROR resourcemanager.ApplicationMasterService 
> (ApplicationMasterService.java:allocate(247)) - AppAttemptId doesnt exist in 
> cache appattempt_1348248013002_0001_000001
> 2012-09-21 18:02:08,624 ERROR resourcemanager.ApplicationMasterService 
> (ApplicationMasterService.java:allocate(247)) - AppAttemptId doesnt exist in 
> cache appattempt_1348248013002_0001_000001
> Is there any way to shut this off/fix it? It just keeps going forever, until 
> I bounce the RM node.
> Thanks!
> Chris

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (MAPREDUCE-4672) RM with lost NMs results in massive log of AppAttemptId doesnt exist in cache

Reply via email to