John Thomason created MAPREDUCE-7160:
----------------------------------------
Summary: Job end notification not sent and client service not
stopped after YarnRuntimeException in shutDownJob in MRAppMaster.java
Key: MAPREDUCE-7160
URL: https://issues.apache.org/jira/browse/MAPREDUCE-7160
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: applicationmaster
Affects Versions: 3.0.3
Reporter: John Thomason
If a YarnRuntimeException occurs during shutDownJob (in this case, at the line
'MRAppMaster.this.stop();') in MRAppMaster.java,
{code:java}
2018-11-20 12:38:46,173 INFO [Thread-171]
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop, writing
event TASK_FINISHED
2018-11-20 12:38:46,173 ERROR [Thread-171]
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error writing
History Event:
org.apache.hadoop.mapreduce.jobhistory.TaskFinishedEvent@3b1a7bb5
java.io.IOException: All datanodes
[DatanodeInfoWithStorage[10.11.1.227:9866,DS-3539e7b8-5d87-45e5-880a-1897f11577d2,DISK]]
are bad. Aborting...
at
org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
at
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
at
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) 2018-11-20
12:38:46,174 INFO [Thread-171] org.apache.hadoop.service.AbstractService:
Service JobHistoryEventHandler failed in state STOPPED
{code}
then the line
{code:java}
catch (Throwable t) {
LOG.warn("Graceful stop failed. Exiting.. ", t);
exitMRAppMaster(1, t);
}
{code}
causes the MRAppMaster to exit before the finally block can execute. This means
that the job end notification is not sent despite
[https://jira.apache.org/jira/browse/MAPREDUCE-6895?attachmentOrder=desc].
Additionally, this finally block should also call 'clientService.stop();', as
otherwise the following errors occur on the client side:
{code:java}
2018-11-07 19:50:59,602 INFO mapreduce.Job: map 100% reduce 98%
2018-11-07 19:51:03,617 INFO mapreduce.Job: map 100% reduce 99%
2018-11-07 19:51:07,648 INFO mapreduce.Job: map 100% reduce 100%
2018-11-07 19:51:07,656 INFO mapreduce.Job: Job job_1541647829228_0001
completed successfully
2018-11-07 19:51:08,162 INFO mapred.ClientServiceDelegate: Application state is
completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2018-11-07 19:51:09,169 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-11-07 19:51:10,170 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-11-07 19:51:11,171 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
......
2018-11-07 19:51:38,407 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
java.io.IOException: java.net.ConnectException: Your endpoint configuration is
wrong; For more details see: http://wiki.apache.org/hadoop/UnsetHostnameOrPort
at
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:344)
at
org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:382)
at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:859) at
org.apache.hadoop.mapreduce.Job$8.run(Job.java:820) at
org.apache.hadoop.mapreduce.Job$8.run(Job.java:817) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686)
at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:817) at
org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1663) at
org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591) at
org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:334) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at
org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:343) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at
org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.hadoop.util.RunJar.run(RunJar.java:244) at
org.apache.hadoop.util.RunJar.main(RunJar.java:158)
{code}
Adding calls to send the job end notification and stop the client service prior
to ' exitMRAppMaster' within the catch block fixes this issue.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]