[
https://issues.apache.org/jira/browse/MAPREDUCE-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
John Thomason updated MAPREDUCE-7160:
-------------------------------------
Description:
If a YarnRuntimeException occurs during shutDownJob (in this case, at the line
'MRAppMaster.this.stop();') in MRAppMaster.java, the try/catch block will catch
it but the 'finally' block will not get executed.
For example, if this error occurs:
{code:java}
2018-11-20 12:38:46,173 INFO [Thread-171]
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop, writing
event TASK_FINISHED
2018-11-20 12:38:46,173 ERROR [Thread-171]
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error writing
History Event:
org.apache.hadoop.mapreduce.jobhistory.TaskFinishedEvent@3b1a7bb5
java.io.IOException: All datanodes
[DatanodeInfoWithStorage[10.11.1.227:9866,DS-3539e7b8-5d87-45e5-880a-1897f11577d2,DISK]]
are bad. Aborting...
at
org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
at
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
at
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) 2018-11-20
12:38:46,174 INFO [Thread-171] org.apache.hadoop.service.AbstractService:
Service JobHistoryEventHandler failed in state STOPPED
{code}
Then the line
{code:java}
catch (Throwable t) {
LOG.warn("Graceful stop failed. Exiting.. ", t);
exitMRAppMaster(1, t);
}
{code}
causes the MRAppMaster to exit before the following finally block can execute.
This means that the job end notification is not sent despite
[https://jira.apache.org/jira/browse/MAPREDUCE-6895?attachmentOrder=desc].
Additionally, this finally block should also call 'clientService.stop();', as
otherwise the following errors occur on the client side:
{code:java}
2018-11-07 19:50:59,602 INFO mapreduce.Job: map 100% reduce 98%
2018-11-07 19:51:03,617 INFO mapreduce.Job: map 100% reduce 99%
2018-11-07 19:51:07,648 INFO mapreduce.Job: map 100% reduce 100%
2018-11-07 19:51:07,656 INFO mapreduce.Job: Job job_1541647829228_0001
completed successfully
2018-11-07 19:51:08,162 INFO mapred.ClientServiceDelegate: Application state is
completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2018-11-07 19:51:09,169 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-11-07 19:51:10,170 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-11-07 19:51:11,171 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
......
2018-11-07 19:51:38,407 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
java.io.IOException: java.net.ConnectException: Your endpoint configuration is
wrong; For more details see: http://wiki.apache.org/hadoop/UnsetHostnameOrPort
at
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:344)
at
org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:382)
at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:859) at
org.apache.hadoop.mapreduce.Job$8.run(Job.java:820) at
org.apache.hadoop.mapreduce.Job$8.run(Job.java:817) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686)
at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:817) at
org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1663) at
org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591) at
org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:334) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at
org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:343) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at
org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.hadoop.util.RunJar.run(RunJar.java:244) at
org.apache.hadoop.util.RunJar.main(RunJar.java:158)
{code}
Moving and adding calls to send the job end notification and stop the client
service prior to 'exitMRAppMaster' within the catch block fixes this issue.
was:
If a YarnRuntimeException occurs during shutDownJob (in this case, at the line
'MRAppMaster.this.stop();') in MRAppMaster.java,
{code:java}
2018-11-20 12:38:46,173 INFO [Thread-171]
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop, writing
event TASK_FINISHED
2018-11-20 12:38:46,173 ERROR [Thread-171]
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error writing
History Event:
org.apache.hadoop.mapreduce.jobhistory.TaskFinishedEvent@3b1a7bb5
java.io.IOException: All datanodes
[DatanodeInfoWithStorage[10.11.1.227:9866,DS-3539e7b8-5d87-45e5-880a-1897f11577d2,DISK]]
are bad. Aborting...
at
org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
at
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
at
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) 2018-11-20
12:38:46,174 INFO [Thread-171] org.apache.hadoop.service.AbstractService:
Service JobHistoryEventHandler failed in state STOPPED
{code}
then the line
{code:java}
catch (Throwable t) {
LOG.warn("Graceful stop failed. Exiting.. ", t);
exitMRAppMaster(1, t);
}
{code}
causes the MRAppMaster to exit before the finally block can execute. This means
that the job end notification is not sent despite
[https://jira.apache.org/jira/browse/MAPREDUCE-6895?attachmentOrder=desc].
Additionally, this finally block should also call 'clientService.stop();', as
otherwise the following errors occur on the client side:
{code:java}
2018-11-07 19:50:59,602 INFO mapreduce.Job: map 100% reduce 98%
2018-11-07 19:51:03,617 INFO mapreduce.Job: map 100% reduce 99%
2018-11-07 19:51:07,648 INFO mapreduce.Job: map 100% reduce 100%
2018-11-07 19:51:07,656 INFO mapreduce.Job: Job job_1541647829228_0001
completed successfully
2018-11-07 19:51:08,162 INFO mapred.ClientServiceDelegate: Application state is
completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2018-11-07 19:51:09,169 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-11-07 19:51:10,170 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-11-07 19:51:11,171 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
......
2018-11-07 19:51:38,407 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
java.io.IOException: java.net.ConnectException: Your endpoint configuration is
wrong; For more details see: http://wiki.apache.org/hadoop/UnsetHostnameOrPort
at
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:344)
at
org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:382)
at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:859) at
org.apache.hadoop.mapreduce.Job$8.run(Job.java:820) at
org.apache.hadoop.mapreduce.Job$8.run(Job.java:817) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686)
at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:817) at
org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1663) at
org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591) at
org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:334) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at
org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:343) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at
org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.hadoop.util.RunJar.run(RunJar.java:244) at
org.apache.hadoop.util.RunJar.main(RunJar.java:158)
{code}
Moving and adding calls to send the job end notification and stop the client
service prior to 'exitMRAppMaster' within the catch block fixes this issue.
> Job end notification not sent and client service not stopped after
> YarnRuntimeException in shutDownJob in MRAppMaster.java despite finally block
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-7160
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7160
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster
> Affects Versions: 3.0.3
> Reporter: John Thomason
> Priority: Major
>
> If a YarnRuntimeException occurs during shutDownJob (in this case, at the
> line 'MRAppMaster.this.stop();') in MRAppMaster.java, the try/catch block
> will catch it but the 'finally' block will not get executed.
> For example, if this error occurs:
> {code:java}
> 2018-11-20 12:38:46,173 INFO [Thread-171]
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop,
> writing event TASK_FINISHED
> 2018-11-20 12:38:46,173 ERROR [Thread-171]
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error writing
> History Event:
> org.apache.hadoop.mapreduce.jobhistory.TaskFinishedEvent@3b1a7bb5
> java.io.IOException: All datanodes
> [DatanodeInfoWithStorage[10.11.1.227:9866,DS-3539e7b8-5d87-45e5-880a-1897f11577d2,DISK]]
> are bad. Aborting...
> at
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
> at
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
> at
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
> at
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) 2018-11-20
> 12:38:46,174 INFO [Thread-171] org.apache.hadoop.service.AbstractService:
> Service JobHistoryEventHandler failed in state STOPPED
> {code}
> Then the line
> {code:java}
> catch (Throwable t) {
> LOG.warn("Graceful stop failed. Exiting.. ", t);
> exitMRAppMaster(1, t);
> }
> {code}
> causes the MRAppMaster to exit before the following finally block can
> execute. This means that the job end notification is not sent despite
> [https://jira.apache.org/jira/browse/MAPREDUCE-6895?attachmentOrder=desc].
> Additionally, this finally block should also call 'clientService.stop();', as
> otherwise the following errors occur on the client side:
> {code:java}
> 2018-11-07 19:50:59,602 INFO mapreduce.Job: map 100% reduce 98%
> 2018-11-07 19:51:03,617 INFO mapreduce.Job: map 100% reduce 99%
> 2018-11-07 19:51:07,648 INFO mapreduce.Job: map 100% reduce 100%
> 2018-11-07 19:51:07,656 INFO mapreduce.Job: Job job_1541647829228_0001
> completed successfully
> 2018-11-07 19:51:08,162 INFO mapred.ClientServiceDelegate: Application state
> is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history
> server
> 2018-11-07 19:51:09,169 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS) 2018-11-07 19:51:10,170 INFO ipc.Client: Retrying connect to
> server: 0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS) 2018-11-07 19:51:11,171 INFO ipc.Client: Retrying connect to
> server: 0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> ......
> 2018-11-07 19:51:38,407 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS) java.io.IOException: java.net.ConnectException: Your endpoint
> configuration is wrong; For more details see:
> http://wiki.apache.org/hadoop/UnsetHostnameOrPort at
> org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:344)
> at
> org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:382)
> at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:859)
> at org.apache.hadoop.mapreduce.Job$8.run(Job.java:820) at
> org.apache.hadoop.mapreduce.Job$8.run(Job.java:817) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:422) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686)
> at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:817) at
> org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1663) at
> org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591) at
> org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:334) at
> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at
> org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:343) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498) at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
> at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at
> org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498) at
> org.apache.hadoop.util.RunJar.run(RunJar.java:244) at
> org.apache.hadoop.util.RunJar.main(RunJar.java:158)
> {code}
>
> Moving and adding calls to send the job end notification and stop the client
> service prior to 'exitMRAppMaster' within the catch block fixes this issue.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]