[
https://issues.apache.org/jira/browse/MAPREDUCE-6437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635151#comment-14635151
]
Jason Lowe commented on MAPREDUCE-6437:
---------------------------------------
There are a couple of problems here. First is that HDFS is the layer that
should have retried. If clients need to wrap HDFS calls in retries then that
seems backwards. The second problem is that the output committer is pluggable,
and there's no indication whether commits are safe to retry. That's being
tracked by MAPREDUCE-5485.
Therefore this seems to be either a bug against HDFS if it didn't retry
sufficiently in the HDFS layer or a duplicate of MAPREDUCE-5485.
> Add retry on some connection exception on job commit phase
> ----------------------------------------------------------
>
> Key: MAPREDUCE-6437
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6437
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Reporter: Bing Jiang
>
> {code}
> Job commit failed: java.net.ConnectException: Call From
> TS-DN-167/172.22.5.167 to SHYF-H11-BH03:52310 failed on connection exception:
> java.net.ConnectException: Connection timed out; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
> at org.apache.hadoop.ipc.Client.call(Client.java:1415)
> at org.apache.hadoop.ipc.Client.call(Client.java:1364)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy14.create(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:287)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy15.create(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1645)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1627)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1552)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:396)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:392)
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:392)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:336)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
> at
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.touchz(CommitterEventHandler.java:244)
> at
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:250)
> at
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:216)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection timed out
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
> at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
> at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
> at org.apache.hadoop.ipc.Client.call(Client.java:1382)
> ... 28 more
> {code}
> Check the code, there is no chance to make another application master attempt
> if it encounters the issue of connection. So could we identify the exception,
> and make another retry or kick off another AM attempt?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)