[ https://issues.apache.org/jira/browse/MAPREDUCE-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jian He updated MAPREDUCE-5488: ------------------------------- Attachment: MAPREDUCE-5488.patch upload a patch that if it's AM connection failure, do not decrement retry count. AM will restarted in the future. In the case that job really failed and AM will not be restarted, JobClient will query RM for the final status. > Job recovery fails after killing all the running containers for the app > ----------------------------------------------------------------------- > > Key: MAPREDUCE-5488 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5488 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 2.1.0-beta > Reporter: Arpit Gupta > Assignee: Jian He > Attachments: MAPREDUCE-5488.patch > > > Here is the client stack trace > {code} > RUNNING: /usr/lib/hadoop/bin/hadoop jar > /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.1.0.2.0.5.0-66.jar > wordcount "-Dmapreduce.reduce.input.limit=-1" > /user/user/test_yarn_ha/medium_wordcount_input > /user/hrt_qa/test_yarn_ha/test_mapred_ha_single_job_applicationmaster-1-time > 13/08/30 08:45:39 INFO client.RMProxy: Connecting to ResourceManager at > hostname/68.142.247.148:8032 > 13/08/30 08:45:40 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 19 > for user on ha-hdfs:ha-2-secure > 13/08/30 08:45:40 INFO security.TokenCache: Got dt for hdfs://ha-2-secure; > Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:ha-2-secure, Ident: > (HDFS_DELEGATION_TOKEN token 19 for user) > 13/08/30 08:45:40 INFO input.FileInputFormat: Total input paths to process : > 20 > 13/08/30 08:45:40 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library > 13/08/30 08:45:40 INFO lzo.LzoCodec: Successfully loaded & initialized > native-lzo library [hadoop-lzo rev cf4e7cbf8ed0f0622504d008101c2729dc0c9ff3] > 13/08/30 08:45:40 INFO mapreduce.JobSubmitter: number of splits:180 > 13/08/30 08:45:40 WARN conf.Configuration: user.name is deprecated. Instead, > use mapreduce.job.user.name > 13/08/30 08:45:40 WARN conf.Configuration: mapred.jar is deprecated. Instead, > use mapreduce.job.jar > 13/08/30 08:45:40 WARN conf.Configuration: mapred.output.value.class is > deprecated. Instead, use mapreduce.job.output.value.class > 13/08/30 08:45:40 WARN conf.Configuration: mapreduce.combine.class is > deprecated. Instead, use mapreduce.job.combine.class > 13/08/30 08:45:40 WARN conf.Configuration: mapreduce.map.class is deprecated. > Instead, use mapreduce.job.map.class > 13/08/30 08:45:40 WARN conf.Configuration: mapred.job.name is deprecated. > Instead, use mapreduce.job.name > 13/08/30 08:45:40 WARN conf.Configuration: mapreduce.reduce.class is > deprecated. Instead, use mapreduce.job.reduce.class > 13/08/30 08:45:40 WARN conf.Configuration: mapred.input.dir is deprecated. > Instead, use mapreduce.input.fileinputformat.inputdir > 13/08/30 08:45:40 WARN conf.Configuration: mapred.output.dir is deprecated. > Instead, use mapreduce.output.fileoutputformat.outputdir > 13/08/30 08:45:40 WARN conf.Configuration: mapred.map.tasks is deprecated. > Instead, use mapreduce.job.maps > 13/08/30 08:45:40 WARN conf.Configuration: mapred.output.key.class is > deprecated. Instead, use mapreduce.job.output.key.class > 13/08/30 08:45:40 WARN conf.Configuration: mapred.working.dir is deprecated. > Instead, use mapreduce.job.working.dir > 13/08/30 08:45:41 INFO mapreduce.JobSubmitter: Submitting tokens for job: > job_1377851032086_0003 > 13/08/30 08:45:41 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, > Service: ha-hdfs:ha-2-secure, Ident: (HDFS_DELEGATION_TOKEN token 19 for user) > 13/08/30 08:45:42 INFO impl.YarnClientImpl: Submitted application > application_1377851032086_0003 to ResourceManager at > hostname/68.142.247.148:8032 > 13/08/30 08:45:42 INFO mapreduce.Job: The url to track the job: > http://hostname:8088/proxy/application_1377851032086_0003/ > 13/08/30 08:45:42 INFO mapreduce.Job: Running job: job_1377851032086_0003 > 13/08/30 08:45:48 INFO mapreduce.Job: Job job_1377851032086_0003 running in > uber mode : false > 13/08/30 08:45:48 INFO mapreduce.Job: map 0% reduce 0% > stop applicationmaster > beaver.component.hadoop|INFO|Kill container > container_1377851032086_0003_01_000001 on host hostname > RUNNING: ssh -o StrictHostKeyChecking=no hostname "sudo su - -c \"ps aux | > grep container_1377851032086_0003_01_000001 | awk '{print \\\$2}' | xargs > kill -9\" root" > Warning: Permanently added 'hostname,68.142.247.155' (RSA) to the list of > known hosts. > kill 8978: No such process > waiting for down time 10 seconds for service applicationmaster > 13/08/30 08:45:55 INFO ipc.Client: Retrying connect to server: > hostname/68.142.247.155:52713. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS) > 13/08/30 08:45:56 INFO ipc.Client: Retrying connect to server: > hostname/68.142.247.155:52713. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS) > 13/08/30 08:45:56 ERROR security.UserGroupInformation: > PriviledgedActionException as:user@REALM (auth:KERBEROS) > cause:java.io.IOException: java.net.ConnectException: Call From > hostname.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > java.io.IOException: java.net.ConnectException: Call From > hostname.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > at > org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:319) > at > org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:354) > at > org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:529) > at org.apache.hadoop.mapreduce.Job$5.run(Job.java:668) > at org.apache.hadoop.mapreduce.Job$5.run(Job.java:665) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1477) > at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:665) > at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1349) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1289) > at org.apache.hadoop.examples.WordCount.main(WordCount.java:84) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) > at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) > at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:212) > Caused by: java.net.ConnectException: Call From hostname.ConnectException: > Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) > at org.apache.hadoop.ipc.Client.call(Client.java:1351) > at org.apache.hadoop.ipc.Client.call(Client.java:1300) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at $Proxy14.getTaskAttemptCompletionEvents(Unknown Source) > at > org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:177) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:310) > ... 23 more > Caused by: java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) > at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:547) > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:642) > at org.apache.hadoop.ipc.Client$Connection.access$2600(Client.java:314) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1399) > at org.apache.hadoop.ipc.Client.call(Client.java:1318) > ... 32 more > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira