[jira] [Updated] (MAPREDUCE-7160) Job end notification not sent and client service not stopped after YarnRuntimeException in shutDownJob in MRAppMaster.java despite finally block

John Thomason (JIRA) Tue, 20 Nov 2018 17:01:08 -0800


     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


John Thomason updated MAPREDUCE-7160:
-------------------------------------
    Description: 
If a YarnRuntimeException occurs during shutDownJob (in this case, at the line 
'MRAppMaster.this.stop();') in MRAppMaster.java, 
{code:java}
2018-11-20 12:38:46,173 INFO [Thread-171] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop, writing 
event TASK_FINISHED 
2018-11-20 12:38:46,173 ERROR [Thread-171] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error writing 
History Event: 
org.apache.hadoop.mapreduce.jobhistory.TaskFinishedEvent@3b1a7bb5 
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[10.11.1.227:9866,DS-3539e7b8-5d87-45e5-880a-1897f11577d2,DISK]]
 are bad. Aborting... 
at 
org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) 
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
 at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
 at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
 at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) 2018-11-20 
12:38:46,174 INFO [Thread-171] org.apache.hadoop.service.AbstractService: 
Service JobHistoryEventHandler failed in state STOPPED

{code}
then the line 
{code:java}
catch (Throwable t) { 
LOG.warn("Graceful stop failed. Exiting.. ", t); 
exitMRAppMaster(1, t); 
}

{code}
causes the MRAppMaster to exit before the finally block can execute. This means 
that the job end notification is not sent despite 
[https://jira.apache.org/jira/browse/MAPREDUCE-6895?attachmentOrder=desc]. 

Additionally, this finally block should also call 'clientService.stop();', as 
otherwise the following errors occur on the client side:
{code:java}
2018-11-07 19:50:59,602 INFO mapreduce.Job: map 100% reduce 98% 
2018-11-07 19:51:03,617 INFO mapreduce.Job: map 100% reduce 99% 
2018-11-07 19:51:07,648 INFO mapreduce.Job: map 100% reduce 100% 
2018-11-07 19:51:07,656 INFO mapreduce.Job: Job job_1541647829228_0001 
completed successfully 
2018-11-07 19:51:08,162 INFO mapred.ClientServiceDelegate: Application state is 
completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 
2018-11-07 19:51:09,169 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 
2018-11-07 19:51:10,170 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 
2018-11-07 19:51:11,171 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
......
2018-11-07 19:51:38,407 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 
java.io.IOException: java.net.ConnectException: Your endpoint configuration is 
wrong; For more details see: http://wiki.apache.org/hadoop/UnsetHostnameOrPort 
at 
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:344)
 at 
org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:382)
 at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:859) at 
org.apache.hadoop.mapreduce.Job$8.run(Job.java:820) at 
org.apache.hadoop.mapreduce.Job$8.run(Job.java:817) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686)
 at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:817) at 
org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1663) at 
org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591) at 
org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:334) at 
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at 
org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:343) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
 at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at 
org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.hadoop.util.RunJar.run(RunJar.java:244) at 
org.apache.hadoop.util.RunJar.main(RunJar.java:158)
{code}
 

Moving and adding calls to send the job end notification and stop the client 
service prior to 'exitMRAppMaster' within the catch block fixes this issue.

  was:
If a YarnRuntimeException occurs during shutDownJob (in this case, at the line 
'MRAppMaster.this.stop();') in MRAppMaster.java, 
{code:java}
2018-11-20 12:38:46,173 INFO [Thread-171] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop, writing 
event TASK_FINISHED 
2018-11-20 12:38:46,173 ERROR [Thread-171] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error writing 
History Event: 
org.apache.hadoop.mapreduce.jobhistory.TaskFinishedEvent@3b1a7bb5 
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[10.11.1.227:9866,DS-3539e7b8-5d87-45e5-880a-1897f11577d2,DISK]]
 are bad. Aborting... 
at 
org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) 
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
 at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
 at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
 at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) 2018-11-20 
12:38:46,174 INFO [Thread-171] org.apache.hadoop.service.AbstractService: 
Service JobHistoryEventHandler failed in state STOPPED

{code}
then the line 
{code:java}
catch (Throwable t) { 
LOG.warn("Graceful stop failed. Exiting.. ", t); 
exitMRAppMaster(1, t); 
}

{code}
causes the MRAppMaster to exit before the finally block can execute. This means 
that the job end notification is not sent despite 
[https://jira.apache.org/jira/browse/MAPREDUCE-6895?attachmentOrder=desc]. 

Additionally, this finally block should also call 'clientService.stop();', as 
otherwise the following errors occur on the client side:
{code:java}
2018-11-07 19:50:59,602 INFO mapreduce.Job: map 100% reduce 98% 
2018-11-07 19:51:03,617 INFO mapreduce.Job: map 100% reduce 99% 
2018-11-07 19:51:07,648 INFO mapreduce.Job: map 100% reduce 100% 
2018-11-07 19:51:07,656 INFO mapreduce.Job: Job job_1541647829228_0001 
completed successfully 
2018-11-07 19:51:08,162 INFO mapred.ClientServiceDelegate: Application state is 
completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 
2018-11-07 19:51:09,169 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 
2018-11-07 19:51:10,170 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 
2018-11-07 19:51:11,171 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
......
2018-11-07 19:51:38,407 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 
java.io.IOException: java.net.ConnectException: Your endpoint configuration is 
wrong; For more details see: http://wiki.apache.org/hadoop/UnsetHostnameOrPort 
at 
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:344)
 at 
org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:382)
 at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:859) at 
org.apache.hadoop.mapreduce.Job$8.run(Job.java:820) at 
org.apache.hadoop.mapreduce.Job$8.run(Job.java:817) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686)
 at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:817) at 
org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1663) at 
org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591) at 
org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:334) at 
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at 
org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:343) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
 at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at 
org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.hadoop.util.RunJar.run(RunJar.java:244) at 
org.apache.hadoop.util.RunJar.main(RunJar.java:158)
{code}
 

Adding calls to send the job end notification and stop the client service prior 
to 'exitMRAppMaster' within the catch block fixes this issue.


> Job end notification not sent and client service not stopped after 
> YarnRuntimeException in shutDownJob in MRAppMaster.java despite finally block
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-7160
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7160
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 3.0.3
>            Reporter: John Thomason
>            Priority: Major
>
> If a YarnRuntimeException occurs during shutDownJob (in this case, at the 
> line 'MRAppMaster.this.stop();') in MRAppMaster.java, 
> {code:java}
> 2018-11-20 12:38:46,173 INFO [Thread-171] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop, 
> writing event TASK_FINISHED 
> 2018-11-20 12:38:46,173 ERROR [Thread-171] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error writing 
> History Event: 
> org.apache.hadoop.mapreduce.jobhistory.TaskFinishedEvent@3b1a7bb5 
> java.io.IOException: All datanodes 
> [DatanodeInfoWithStorage[10.11.1.227:9866,DS-3539e7b8-5d87-45e5-880a-1897f11577d2,DISK]]
>  are bad. Aborting... 
> at 
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) 
> at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
>  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) 2018-11-20 
> 12:38:46,174 INFO [Thread-171] org.apache.hadoop.service.AbstractService: 
> Service JobHistoryEventHandler failed in state STOPPED
> {code}
> then the line 
> {code:java}
> catch (Throwable t) { 
> LOG.warn("Graceful stop failed. Exiting.. ", t); 
> exitMRAppMaster(1, t); 
> }
> {code}
> causes the MRAppMaster to exit before the finally block can execute. This 
> means that the job end notification is not sent despite 
> [https://jira.apache.org/jira/browse/MAPREDUCE-6895?attachmentOrder=desc]. 
> Additionally, this finally block should also call 'clientService.stop();', as 
> otherwise the following errors occur on the client side:
> {code:java}
> 2018-11-07 19:50:59,602 INFO mapreduce.Job: map 100% reduce 98% 
> 2018-11-07 19:51:03,617 INFO mapreduce.Job: map 100% reduce 99% 
> 2018-11-07 19:51:07,648 INFO mapreduce.Job: map 100% reduce 100% 
> 2018-11-07 19:51:07,656 INFO mapreduce.Job: Job job_1541647829228_0001 
> completed successfully 
> 2018-11-07 19:51:08,162 INFO mapred.ClientServiceDelegate: Application state 
> is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history 
> server 
> 2018-11-07 19:51:09,169 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS) 2018-11-07 19:51:10,170 INFO ipc.Client: Retrying connect to 
> server: 0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS) 2018-11-07 19:51:11,171 INFO ipc.Client: Retrying connect to 
> server: 0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ......
> 2018-11-07 19:51:38,407 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS) java.io.IOException: java.net.ConnectException: Your endpoint 
> configuration is wrong; For more details see: 
> http://wiki.apache.org/hadoop/UnsetHostnameOrPort at 
> org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:344)
>  at 
> org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:382)
>  at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:859) 
> at org.apache.hadoop.mapreduce.Job$8.run(Job.java:820) at 
> org.apache.hadoop.mapreduce.Job$8.run(Job.java:817) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686)
>  at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:817) at 
> org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1663) at 
> org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591) at 
> org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:334) at 
> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at 
> org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:343) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
>  at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at 
> org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.util.RunJar.run(RunJar.java:244) at 
> org.apache.hadoop.util.RunJar.main(RunJar.java:158)
> {code}
>  
> Moving and adding calls to send the job end notification and stop the client 
> service prior to 'exitMRAppMaster' within the catch block fixes this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (MAPREDUCE-7160) Job end notification not sent and client service not stopped after YarnRuntimeException in shutDownJob in MRAppMaster.java despite finally block

Reply via email to