[ https://issues.apache.org/jira/browse/MAPREDUCE-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
John Thomason updated MAPREDUCE-7160: ------------------------------------- Description: If a YarnRuntimeException occurs during shutDownJob (in this case, at the line 'MRAppMaster.this.stop();') in MRAppMaster.java, the try/catch block will catch it but the 'finally' block will not get executed. Additionally, the client service should be stopped if an exception occurs and is not currently. For example, if this error occurs: {code:java} 2018-11-20 12:38:46,173 INFO [Thread-171] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop, writing event TASK_FINISHED 2018-11-20 12:38:46,173 ERROR [Thread-171] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error writing History Event: org.apache.hadoop.mapreduce.jobhistory.TaskFinishedEvent@3b1a7bb5 java.io.IOException: All datanodes [DatanodeInfoWithStorage[10.11.1.227:9866,DS-3539e7b8-5d87-45e5-880a-1897f11577d2,DISK]] are bad. Aborting... at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) 2018-11-20 12:38:46,174 INFO [Thread-171] org.apache.hadoop.service.AbstractService: Service JobHistoryEventHandler failed in state STOPPED {code} Then the line {code:java} catch (Throwable t) { LOG.warn("Graceful stop failed. Exiting.. ", t); exitMRAppMaster(1, t); } {code} causes the MRAppMaster to exit before the following finally block can execute. This means that the job end notification is not sent despite [https://jira.apache.org/jira/browse/MAPREDUCE-6895?attachmentOrder=desc]. Additionally, this finally block should also call 'clientService.stop();', as otherwise the following errors occur on the client side: {code:java} 2018-11-07 19:50:59,602 INFO mapreduce.Job: map 100% reduce 98% 2018-11-07 19:51:03,617 INFO mapreduce.Job: map 100% reduce 99% 2018-11-07 19:51:07,648 INFO mapreduce.Job: map 100% reduce 100% 2018-11-07 19:51:07,656 INFO mapreduce.Job: Job job_1541647829228_0001 completed successfully 2018-11-07 19:51:08,162 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2018-11-07 19:51:09,169 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2018-11-07 19:51:10,170 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2018-11-07 19:51:11,171 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) ...... 2018-11-07 19:51:38,407 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) java.io.IOException: java.net.ConnectException: Your endpoint configuration is wrong; For more details see: http://wiki.apache.org/hadoop/UnsetHostnameOrPort at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:344) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:382) at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:859) at org.apache.hadoop.mapreduce.Job$8.run(Job.java:820) at org.apache.hadoop.mapreduce.Job$8.run(Job.java:817) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686) at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:817) at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1663) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591) at org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:334) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:343) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:244) at org.apache.hadoop.util.RunJar.main(RunJar.java:158) {code} Removing the finally block and adding calls to send the job end notification and stop the client service prior to 'exitMRAppMaster' within the catch block, such as something similar to below. fixes the issue: {code:java} catch (Throwable t) { LOG.warn("Graceful stop failed. Exiting.. ", t); if (isLastAMRetry && notifier != null) { sendJobEndNotify(notifier); } try { Thread.sleep(5000); } catch (InterruptedException e) { e.printStackTrace(); } clientService.stop(); exitMRAppMaster(1, t); } {code} was: If a YarnRuntimeException occurs during shutDownJob (in this case, at the line 'MRAppMaster.this.stop();') in MRAppMaster.java, the try/catch block will catch it but the 'finally' block will not get executed. Additionally, the client service should be stopped if an exception occurs and is not currently. For example, if this error occurs: {code:java} 2018-11-20 12:38:46,173 INFO [Thread-171] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop, writing event TASK_FINISHED 2018-11-20 12:38:46,173 ERROR [Thread-171] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error writing History Event: org.apache.hadoop.mapreduce.jobhistory.TaskFinishedEvent@3b1a7bb5 java.io.IOException: All datanodes [DatanodeInfoWithStorage[10.11.1.227:9866,DS-3539e7b8-5d87-45e5-880a-1897f11577d2,DISK]] are bad. Aborting... at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) 2018-11-20 12:38:46,174 INFO [Thread-171] org.apache.hadoop.service.AbstractService: Service JobHistoryEventHandler failed in state STOPPED {code} Then the line {code:java} catch (Throwable t) { LOG.warn("Graceful stop failed. Exiting.. ", t); exitMRAppMaster(1, t); } {code} causes the MRAppMaster to exit before the following finally block can execute. This means that the job end notification is not sent despite [https://jira.apache.org/jira/browse/MAPREDUCE-6895?attachmentOrder=desc]. Additionally, this finally block should also call 'clientService.stop();', as otherwise the following errors occur on the client side: {code:java} 2018-11-07 19:50:59,602 INFO mapreduce.Job: map 100% reduce 98% 2018-11-07 19:51:03,617 INFO mapreduce.Job: map 100% reduce 99% 2018-11-07 19:51:07,648 INFO mapreduce.Job: map 100% reduce 100% 2018-11-07 19:51:07,656 INFO mapreduce.Job: Job job_1541647829228_0001 completed successfully 2018-11-07 19:51:08,162 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2018-11-07 19:51:09,169 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2018-11-07 19:51:10,170 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2018-11-07 19:51:11,171 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) ...... 2018-11-07 19:51:38,407 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) java.io.IOException: java.net.ConnectException: Your endpoint configuration is wrong; For more details see: http://wiki.apache.org/hadoop/UnsetHostnameOrPort at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:344) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:382) at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:859) at org.apache.hadoop.mapreduce.Job$8.run(Job.java:820) at org.apache.hadoop.mapreduce.Job$8.run(Job.java:817) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686) at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:817) at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1663) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591) at org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:334) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:343) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:244) at org.apache.hadoop.util.RunJar.main(RunJar.java:158) {code} Moving and adding calls to send the job end notification and stop the client service prior to 'exitMRAppMaster' within the catch block fixes this issue. > Job end notification not sent and client service not stopped after > YarnRuntimeException in shutDownJob in MRAppMaster.java despite finally block > ------------------------------------------------------------------------------------------------------------------------------------------------ > > Key: MAPREDUCE-7160 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7160 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster > Affects Versions: 3.0.3 > Reporter: John Thomason > Priority: Major > > If a YarnRuntimeException occurs during shutDownJob (in this case, at the > line 'MRAppMaster.this.stop();') in MRAppMaster.java, the try/catch block > will catch it but the 'finally' block will not get executed. Additionally, > the client service should be stopped if an exception occurs and is not > currently. > For example, if this error occurs: > {code:java} > 2018-11-20 12:38:46,173 INFO [Thread-171] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop, > writing event TASK_FINISHED > 2018-11-20 12:38:46,173 ERROR [Thread-171] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error writing > History Event: > org.apache.hadoop.mapreduce.jobhistory.TaskFinishedEvent@3b1a7bb5 > java.io.IOException: All datanodes > [DatanodeInfoWithStorage[10.11.1.227:9866,DS-3539e7b8-5d87-45e5-880a-1897f11577d2,DISK]] > are bad. Aborting... > at > org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) 2018-11-20 > 12:38:46,174 INFO [Thread-171] org.apache.hadoop.service.AbstractService: > Service JobHistoryEventHandler failed in state STOPPED > {code} > Then the line > {code:java} > catch (Throwable t) { > LOG.warn("Graceful stop failed. Exiting.. ", t); > exitMRAppMaster(1, t); > } > {code} > causes the MRAppMaster to exit before the following finally block can > execute. This means that the job end notification is not sent despite > [https://jira.apache.org/jira/browse/MAPREDUCE-6895?attachmentOrder=desc]. > Additionally, this finally block should also call 'clientService.stop();', as > otherwise the following errors occur on the client side: > {code:java} > 2018-11-07 19:50:59,602 INFO mapreduce.Job: map 100% reduce 98% > 2018-11-07 19:51:03,617 INFO mapreduce.Job: map 100% reduce 99% > 2018-11-07 19:51:07,648 INFO mapreduce.Job: map 100% reduce 100% > 2018-11-07 19:51:07,656 INFO mapreduce.Job: Job job_1541647829228_0001 > completed successfully > 2018-11-07 19:51:08,162 INFO mapred.ClientServiceDelegate: Application state > is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history > server > 2018-11-07 19:51:09,169 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) 2018-11-07 19:51:10,170 INFO ipc.Client: Retrying connect to > server: 0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) 2018-11-07 19:51:11,171 INFO ipc.Client: Retrying connect to > server: 0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ...... > 2018-11-07 19:51:38,407 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) java.io.IOException: java.net.ConnectException: Your endpoint > configuration is wrong; For more details see: > http://wiki.apache.org/hadoop/UnsetHostnameOrPort at > org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:344) > at > org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:382) > at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:859) > at org.apache.hadoop.mapreduce.Job$8.run(Job.java:820) at > org.apache.hadoop.mapreduce.Job$8.run(Job.java:817) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686) > at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:817) at > org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1663) at > org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591) at > org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:334) at > org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at > org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:343) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71) > at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at > org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.apache.hadoop.util.RunJar.run(RunJar.java:244) at > org.apache.hadoop.util.RunJar.main(RunJar.java:158) > {code} > > Removing the finally block and adding calls to send the job end notification > and stop the client service prior to 'exitMRAppMaster' within the catch > block, such as something similar to below. fixes the issue: > {code:java} > catch (Throwable t) { > LOG.warn("Graceful stop failed. Exiting.. ", t); > if (isLastAMRetry && notifier != null) { > sendJobEndNotify(notifier); > } > try { > Thread.sleep(5000); > } catch (InterruptedException e) { > e.printStackTrace(); > } > clientService.stop(); > exitMRAppMaster(1, t); > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org