[jira] [Updated] (MAPREDUCE-7160) Job end notification not sent and client service not stopped after YarnRuntimeException in shutDownJob in MRAppMaster.java despite finally block

John Thomason (JIRA) Mon, 26 Nov 2018 11:43:18 -0800


     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


John Thomason updated MAPREDUCE-7160:
-------------------------------------
    Description: 
If a YarnRuntimeException occurs during shutDownJob (in this case, at the line 
'MRAppMaster.this.stop();') in MRAppMaster.java, the try/catch block will catch 
it but the 'finally' block will not get executed. Additionally, the client 
service should be stopped if an exception occurs and is not currently.

For example, if this error occurs:
{code:java}
2018-11-20 12:38:46,173 INFO [Thread-171] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop, writing 
event TASK_FINISHED 
2018-11-20 12:38:46,173 ERROR [Thread-171] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error writing 
History Event: 
org.apache.hadoop.mapreduce.jobhistory.TaskFinishedEvent@3b1a7bb5 
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[10.11.1.227:9866,DS-3539e7b8-5d87-45e5-880a-1897f11577d2,DISK]]
 are bad. Aborting... 
at 
org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) 
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
 at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
 at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
 at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) 2018-11-20 
12:38:46,174 INFO [Thread-171] org.apache.hadoop.service.AbstractService: 
Service JobHistoryEventHandler failed in state STOPPED

{code}
Then the line 
{code:java}
catch (Throwable t) { 
   LOG.warn("Graceful stop failed. Exiting.. ", t); 
   exitMRAppMaster(1, t); 
}

{code}
causes the MRAppMaster to exit before the following finally block can execute. 
This means that the job end notification is not sent despite 
[https://jira.apache.org/jira/browse/MAPREDUCE-6895?attachmentOrder=desc]. 

Additionally, this finally block should also call 'clientService.stop();', as 
otherwise the following errors occur on the client side:
{code:java}
2018-11-07 19:50:59,602 INFO mapreduce.Job: map 100% reduce 98% 
2018-11-07 19:51:03,617 INFO mapreduce.Job: map 100% reduce 99% 
2018-11-07 19:51:07,648 INFO mapreduce.Job: map 100% reduce 100% 
2018-11-07 19:51:07,656 INFO mapreduce.Job: Job job_1541647829228_0001 
completed successfully 
2018-11-07 19:51:08,162 INFO mapred.ClientServiceDelegate: Application state is 
completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 
2018-11-07 19:51:09,169 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 
2018-11-07 19:51:10,170 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 
2018-11-07 19:51:11,171 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
......
2018-11-07 19:51:38,407 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 
java.io.IOException: java.net.ConnectException: Your endpoint configuration is 
wrong; For more details see: http://wiki.apache.org/hadoop/UnsetHostnameOrPort 
at 
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:344)
 at 
org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:382)
 at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:859) at 
org.apache.hadoop.mapreduce.Job$8.run(Job.java:820) at 
org.apache.hadoop.mapreduce.Job$8.run(Job.java:817) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686)
 at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:817) at 
org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1663) at 
org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591) at 
org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:334) at 
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at 
org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:343) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
 at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at 
org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.hadoop.util.RunJar.run(RunJar.java:244) at 
org.apache.hadoop.util.RunJar.main(RunJar.java:158)
{code}
 

Removing the finally block and adding calls to send the job end notification 
and stop the client service prior to 'exitMRAppMaster' within the catch block, 
such as something similar to below. fixes the issue:
{code:java}
catch (Throwable t) {
   LOG.warn("Graceful stop failed. Exiting.. ", t);
   if (isLastAMRetry && notifier != null) {
      sendJobEndNotify(notifier);
   }
   try {
      Thread.sleep(5000);
   } catch (InterruptedException e) {
      e.printStackTrace();
   }

   clientService.stop();
   exitMRAppMaster(1, t);
}
{code}
 

  was:
If a YarnRuntimeException occurs during shutDownJob (in this case, at the line 
'MRAppMaster.this.stop();') in MRAppMaster.java, the try/catch block will catch 
it but the 'finally' block will not get executed. Additionally, the client 
service should be stopped if an exception occurs and is not currently.

For example, if this error occurs:
{code:java}
2018-11-20 12:38:46,173 INFO [Thread-171] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop, writing 
event TASK_FINISHED 
2018-11-20 12:38:46,173 ERROR [Thread-171] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error writing 
History Event: 
org.apache.hadoop.mapreduce.jobhistory.TaskFinishedEvent@3b1a7bb5 
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[10.11.1.227:9866,DS-3539e7b8-5d87-45e5-880a-1897f11577d2,DISK]]
 are bad. Aborting... 
at 
org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) 
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
 at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
 at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
 at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) 2018-11-20 
12:38:46,174 INFO [Thread-171] org.apache.hadoop.service.AbstractService: 
Service JobHistoryEventHandler failed in state STOPPED

{code}
Then the line 
{code:java}
catch (Throwable t) { 
LOG.warn("Graceful stop failed. Exiting.. ", t); 
exitMRAppMaster(1, t); 
}

{code}
causes the MRAppMaster to exit before the following finally block can execute. 
This means that the job end notification is not sent despite 
[https://jira.apache.org/jira/browse/MAPREDUCE-6895?attachmentOrder=desc]. 

Additionally, this finally block should also call 'clientService.stop();', as 
otherwise the following errors occur on the client side:
{code:java}
2018-11-07 19:50:59,602 INFO mapreduce.Job: map 100% reduce 98% 
2018-11-07 19:51:03,617 INFO mapreduce.Job: map 100% reduce 99% 
2018-11-07 19:51:07,648 INFO mapreduce.Job: map 100% reduce 100% 
2018-11-07 19:51:07,656 INFO mapreduce.Job: Job job_1541647829228_0001 
completed successfully 
2018-11-07 19:51:08,162 INFO mapred.ClientServiceDelegate: Application state is 
completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 
2018-11-07 19:51:09,169 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 
2018-11-07 19:51:10,170 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 
2018-11-07 19:51:11,171 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
......
2018-11-07 19:51:38,407 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 
java.io.IOException: java.net.ConnectException: Your endpoint configuration is 
wrong; For more details see: http://wiki.apache.org/hadoop/UnsetHostnameOrPort 
at 
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:344)
 at 
org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:382)
 at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:859) at 
org.apache.hadoop.mapreduce.Job$8.run(Job.java:820) at 
org.apache.hadoop.mapreduce.Job$8.run(Job.java:817) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686)
 at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:817) at 
org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1663) at 
org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591) at 
org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:334) at 
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at 
org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:343) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
 at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at 
org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.hadoop.util.RunJar.run(RunJar.java:244) at 
org.apache.hadoop.util.RunJar.main(RunJar.java:158)
{code}
 

Moving and adding calls to send the job end notification and stop the client 
service prior to 'exitMRAppMaster' within the catch block fixes this issue.


> Job end notification not sent and client service not stopped after 
> YarnRuntimeException in shutDownJob in MRAppMaster.java despite finally block
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-7160
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7160
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 3.0.3
>            Reporter: John Thomason
>            Priority: Major
>
> If a YarnRuntimeException occurs during shutDownJob (in this case, at the 
> line 'MRAppMaster.this.stop();') in MRAppMaster.java, the try/catch block 
> will catch it but the 'finally' block will not get executed. Additionally, 
> the client service should be stopped if an exception occurs and is not 
> currently.
> For example, if this error occurs:
> {code:java}
> 2018-11-20 12:38:46,173 INFO [Thread-171] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop, 
> writing event TASK_FINISHED 
> 2018-11-20 12:38:46,173 ERROR [Thread-171] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error writing 
> History Event: 
> org.apache.hadoop.mapreduce.jobhistory.TaskFinishedEvent@3b1a7bb5 
> java.io.IOException: All datanodes 
> [DatanodeInfoWithStorage[10.11.1.227:9866,DS-3539e7b8-5d87-45e5-880a-1897f11577d2,DISK]]
>  are bad. Aborting... 
> at 
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) 
> at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
>  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) 2018-11-20 
> 12:38:46,174 INFO [Thread-171] org.apache.hadoop.service.AbstractService: 
> Service JobHistoryEventHandler failed in state STOPPED
> {code}
> Then the line 
> {code:java}
> catch (Throwable t) { 
>    LOG.warn("Graceful stop failed. Exiting.. ", t); 
>    exitMRAppMaster(1, t); 
> }
> {code}
> causes the MRAppMaster to exit before the following finally block can 
> execute. This means that the job end notification is not sent despite 
> [https://jira.apache.org/jira/browse/MAPREDUCE-6895?attachmentOrder=desc]. 
> Additionally, this finally block should also call 'clientService.stop();', as 
> otherwise the following errors occur on the client side:
> {code:java}
> 2018-11-07 19:50:59,602 INFO mapreduce.Job: map 100% reduce 98% 
> 2018-11-07 19:51:03,617 INFO mapreduce.Job: map 100% reduce 99% 
> 2018-11-07 19:51:07,648 INFO mapreduce.Job: map 100% reduce 100% 
> 2018-11-07 19:51:07,656 INFO mapreduce.Job: Job job_1541647829228_0001 
> completed successfully 
> 2018-11-07 19:51:08,162 INFO mapred.ClientServiceDelegate: Application state 
> is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history 
> server 
> 2018-11-07 19:51:09,169 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS) 2018-11-07 19:51:10,170 INFO ipc.Client: Retrying connect to 
> server: 0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS) 2018-11-07 19:51:11,171 INFO ipc.Client: Retrying connect to 
> server: 0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ......
> 2018-11-07 19:51:38,407 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS) java.io.IOException: java.net.ConnectException: Your endpoint 
> configuration is wrong; For more details see: 
> http://wiki.apache.org/hadoop/UnsetHostnameOrPort at 
> org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:344)
>  at 
> org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:382)
>  at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:859) 
> at org.apache.hadoop.mapreduce.Job$8.run(Job.java:820) at 
> org.apache.hadoop.mapreduce.Job$8.run(Job.java:817) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686)
>  at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:817) at 
> org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1663) at 
> org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591) at 
> org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:334) at 
> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at 
> org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:343) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
>  at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at 
> org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.util.RunJar.run(RunJar.java:244) at 
> org.apache.hadoop.util.RunJar.main(RunJar.java:158)
> {code}
>  
> Removing the finally block and adding calls to send the job end notification 
> and stop the client service prior to 'exitMRAppMaster' within the catch 
> block, such as something similar to below. fixes the issue:
> {code:java}
> catch (Throwable t) {
>    LOG.warn("Graceful stop failed. Exiting.. ", t);
>    if (isLastAMRetry && notifier != null) {
>       sendJobEndNotify(notifier);
>    }
>    try {
>       Thread.sleep(5000);
>    } catch (InterruptedException e) {
>       e.printStackTrace();
>    }
>    clientService.stop();
>    exitMRAppMaster(1, t);
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (MAPREDUCE-7160) Job end notification not sent and client service not stopped after YarnRuntimeException in shutDownJob in MRAppMaster.java despite finally block

Reply via email to