[ https://issues.apache.org/jira/browse/MAPREDUCE-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136648#comment-13136648 ]
Eric Payne commented on MAPREDUCE-3186: --------------------------------------- @Mahadev Would you be able to suggest the best way to unit test this? Here are the manual tests I performed: # I ran the following in a single node cluster: ## Start the {{wordcount}} test with a 1.5G file ## Make sure the {{wordcount}} application is in the RUNNING state. ## Stop the RM, HS, and NM daemons ## Restart the RM, HS, and NM daemons ## Check that MRAppMaster java task has exited and log contains the following string: "ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Error communicating with RM: Resource Manager doesn't recognize AttemptId: application_XXXXXXXXXXXXX_XXXX" # I ran the following in a single node cluster: ## Start the {{wordcount}} test with a 1.5G file ## Make sure the {{wordcount}} application is in the RUNNING state. ## Stop the RM, HS, and NM daemons ## Wait for the MRAppMaster java task to exit. This may take a few minutes. ## Check that MRAppMaster java task has exited and log contains the following string: "ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Error communicating with RM: Could not contact RM after 10 tries." ## 10 is the default number of retries. # I ran the following in a single node cluster: ## Edit the yarn-site.xml config file. ## Add the following property: <br> name = yarn.app.mapreduce.am.scheduler.connection.retries <br> value = 7 ## Start the {{wordcount}} test with a 1.5G file ## Make sure the {{wordcount}} application is in the RUNNING state. ## Stop the RM, HS, and NM daemons ## Wait for the MRAppMaster java task to exit. This may take a few minutes. ## Check that MRAppMaster java task has exited and log contains the following string: "ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Error communicating with RM: Could not contact RM after 7 tries." > User jobs are getting hanged if the Resource manager process goes down and > comes up while job is getting executed. > ------------------------------------------------------------------------------------------------------------------ > > Key: MAPREDUCE-3186 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3186 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Affects Versions: 0.23.0 > Environment: linux > Reporter: Ramgopal N > Assignee: Eric Payne > Priority: Blocker > Labels: test > Attachments: MAPREDUCE-3186.v1.txt > > > If the resource manager is restarted while the job execution is in progress, > the job is getting hanged. > UI shows the job as running. > In the RM log, it is throwing an error "ERROR > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: > AppAttemptId doesnt exist in cache appattempt_1318579738195_0004_000001" > In the console MRAppMaster and Runjar processes are not getting killed -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira