[
https://issues.apache.org/jira/browse/MAPREDUCE-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
John Thomason updated MAPREDUCE-7160:
-------------------------------------
Description:
If a YarnRuntimeException occurs during shutDownJob (in this case, at the line
'MRAppMaster.this.stop();') in MRAppMaster.java, the try/catch block will catch
it but the 'finally' block will not get executed. Additionally, the client
service should be stopped if an exception occurs and is not currently.
For example, if this error occurs:
{code:java}
2018-11-20 12:38:46,173 INFO [Thread-171]
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop, writing
event TASK_FINISHED
2018-11-20 12:38:46,173 ERROR [Thread-171]
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error writing
History Event:
org.apache.hadoop.mapreduce.jobhistory.TaskFinishedEvent@3b1a7bb5
java.io.IOException: All datanodes
[DatanodeInfoWithStorage[10.11.1.227:9866,DS-3539e7b8-5d87-45e5-880a-1897f11577d2,DISK]]
are bad. Aborting...
at
org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
at
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
at
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) 2018-11-20
12:38:46,174 INFO [Thread-171] org.apache.hadoop.service.AbstractService:
Service JobHistoryEventHandler failed in state STOPPED
{code}
Then the line
{code:java}
catch (Throwable t) {
LOG.warn("Graceful stop failed. Exiting.. ", t);
exitMRAppMaster(1, t);
}
{code}
causes the MRAppMaster to exit before the following finally block can execute.
This means that the job end notification is not sent despite
[https://jira.apache.org/jira/browse/MAPREDUCE-6895?attachmentOrder=desc].
Additionally, this finally block should also call 'clientService.stop();', as
otherwise the following errors occur on the client side:
{code:java}
2018-11-07 19:50:59,602 INFO mapreduce.Job: map 100% reduce 98%
2018-11-07 19:51:03,617 INFO mapreduce.Job: map 100% reduce 99%
2018-11-07 19:51:07,648 INFO mapreduce.Job: map 100% reduce 100%
2018-11-07 19:51:07,656 INFO mapreduce.Job: Job job_1541647829228_0001
completed successfully
2018-11-07 19:51:08,162 INFO mapred.ClientServiceDelegate: Application state is
completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2018-11-07 19:51:09,169 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-11-07 19:51:10,170 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-11-07 19:51:11,171 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
......
2018-11-07 19:51:38,407 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
java.io.IOException: java.net.ConnectException: Your endpoint configuration is
wrong; For more details see: http://wiki.apache.org/hadoop/UnsetHostnameOrPort
at
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:344)
at
org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:382)
at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:859) at
org.apache.hadoop.mapreduce.Job$8.run(Job.java:820) at
org.apache.hadoop.mapreduce.Job$8.run(Job.java:817) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686)
at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:817) at
org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1663) at
org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591) at
org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:334) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at
org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:343) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at
org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.hadoop.util.RunJar.run(RunJar.java:244) at
org.apache.hadoop.util.RunJar.main(RunJar.java:158)
{code}
Moving and adding calls to send the job end notification and stop the client
service prior to 'exitMRAppMaster' within the catch block fixes this issue.
was:
If a YarnRuntimeException occurs during shutDownJob (in this case, at the line
'MRAppMaster.this.stop();') in MRAppMaster.java, the try/catch block will catch
it but the 'finally' block will not get executed.
For example, if this error occurs:
{code:java}
2018-11-20 12:38:46,173 INFO [Thread-171]
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop, writing
event TASK_FINISHED
2018-11-20 12:38:46,173 ERROR [Thread-171]
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error writing
History Event:
org.apache.hadoop.mapreduce.jobhistory.TaskFinishedEvent@3b1a7bb5
java.io.IOException: All datanodes
[DatanodeInfoWithStorage[10.11.1.227:9866,DS-3539e7b8-5d87-45e5-880a-1897f11577d2,DISK]]
are bad. Aborting...
at
org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
at
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
at
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) 2018-11-20
12:38:46,174 INFO [Thread-171] org.apache.hadoop.service.AbstractService:
Service JobHistoryEventHandler failed in state STOPPED
{code}
Then the line
{code:java}
catch (Throwable t) {
LOG.warn("Graceful stop failed. Exiting.. ", t);
exitMRAppMaster(1, t);
}
{code}
causes the MRAppMaster to exit before the following finally block can execute.
This means that the job end notification is not sent despite
[https://jira.apache.org/jira/browse/MAPREDUCE-6895?attachmentOrder=desc].
Additionally, this finally block should also call 'clientService.stop();', as
otherwise the following errors occur on the client side:
{code:java}
2018-11-07 19:50:59,602 INFO mapreduce.Job: map 100% reduce 98%
2018-11-07 19:51:03,617 INFO mapreduce.Job: map 100% reduce 99%
2018-11-07 19:51:07,648 INFO mapreduce.Job: map 100% reduce 100%
2018-11-07 19:51:07,656 INFO mapreduce.Job: Job job_1541647829228_0001
completed successfully
2018-11-07 19:51:08,162 INFO mapred.ClientServiceDelegate: Application state is
completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2018-11-07 19:51:09,169 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-11-07 19:51:10,170 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-11-07 19:51:11,171 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
......
2018-11-07 19:51:38,407 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
java.io.IOException: java.net.ConnectException: Your endpoint configuration is
wrong; For more details see: http://wiki.apache.org/hadoop/UnsetHostnameOrPort
at
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:344)
at
org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:382)
at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:859) at
org.apache.hadoop.mapreduce.Job$8.run(Job.java:820) at
org.apache.hadoop.mapreduce.Job$8.run(Job.java:817) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686)
at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:817) at
org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1663) at
org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591) at
org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:334) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at
org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:343) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at
org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.hadoop.util.RunJar.run(RunJar.java:244) at
org.apache.hadoop.util.RunJar.main(RunJar.java:158)
{code}
Moving and adding calls to send the job end notification and stop the client
service prior to 'exitMRAppMaster' within the catch block fixes this issue.
> Job end notification not sent and client service not stopped after
> YarnRuntimeException in shutDownJob in MRAppMaster.java despite finally block
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-7160
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7160
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster
> Affects Versions: 3.0.3
> Reporter: John Thomason
> Priority: Major
>
> If a YarnRuntimeException occurs during shutDownJob (in this case, at the
> line 'MRAppMaster.this.stop();') in MRAppMaster.java, the try/catch block
> will catch it but the 'finally' block will not get executed. Additionally,
> the client service should be stopped if an exception occurs and is not
> currently.
> For example, if this error occurs:
> {code:java}
> 2018-11-20 12:38:46,173 INFO [Thread-171]
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop,
> writing event TASK_FINISHED
> 2018-11-20 12:38:46,173 ERROR [Thread-171]
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error writing
> History Event:
> org.apache.hadoop.mapreduce.jobhistory.TaskFinishedEvent@3b1a7bb5
> java.io.IOException: All datanodes
> [DatanodeInfoWithStorage[10.11.1.227:9866,DS-3539e7b8-5d87-45e5-880a-1897f11577d2,DISK]]
> are bad. Aborting...
> at
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
> at
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
> at
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
> at
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) 2018-11-20
> 12:38:46,174 INFO [Thread-171] org.apache.hadoop.service.AbstractService:
> Service JobHistoryEventHandler failed in state STOPPED
> {code}
> Then the line
> {code:java}
> catch (Throwable t) {
> LOG.warn("Graceful stop failed. Exiting.. ", t);
> exitMRAppMaster(1, t);
> }
> {code}
> causes the MRAppMaster to exit before the following finally block can
> execute. This means that the job end notification is not sent despite
> [https://jira.apache.org/jira/browse/MAPREDUCE-6895?attachmentOrder=desc].
> Additionally, this finally block should also call 'clientService.stop();', as
> otherwise the following errors occur on the client side:
> {code:java}
> 2018-11-07 19:50:59,602 INFO mapreduce.Job: map 100% reduce 98%
> 2018-11-07 19:51:03,617 INFO mapreduce.Job: map 100% reduce 99%
> 2018-11-07 19:51:07,648 INFO mapreduce.Job: map 100% reduce 100%
> 2018-11-07 19:51:07,656 INFO mapreduce.Job: Job job_1541647829228_0001
> completed successfully
> 2018-11-07 19:51:08,162 INFO mapred.ClientServiceDelegate: Application state
> is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history
> server
> 2018-11-07 19:51:09,169 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS) 2018-11-07 19:51:10,170 INFO ipc.Client: Retrying connect to
> server: 0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS) 2018-11-07 19:51:11,171 INFO ipc.Client: Retrying connect to
> server: 0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> ......
> 2018-11-07 19:51:38,407 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS) java.io.IOException: java.net.ConnectException: Your endpoint
> configuration is wrong; For more details see:
> http://wiki.apache.org/hadoop/UnsetHostnameOrPort at
> org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:344)
> at
> org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:382)
> at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:859)
> at org.apache.hadoop.mapreduce.Job$8.run(Job.java:820) at
> org.apache.hadoop.mapreduce.Job$8.run(Job.java:817) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:422) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686)
> at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:817) at
> org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1663) at
> org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591) at
> org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:334) at
> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at
> org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:343) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498) at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
> at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at
> org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498) at
> org.apache.hadoop.util.RunJar.run(RunJar.java:244) at
> org.apache.hadoop.util.RunJar.main(RunJar.java:158)
> {code}
>
> Moving and adding calls to send the job end notification and stop the client
> service prior to 'exitMRAppMaster' within the catch block fixes this issue.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]