[ https://issues.apache.org/jira/browse/HADOOP-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492938 ]
Hairong Kuang commented on HADOOP-1263: --------------------------------------- > The think the create()and cleanup() RPCs can be retried too. I believe that Dhruba meant create() and complete(). +1 on his suggestion. > retry logic when dfs exist or open fails temporarily, e.g because of timeout > ---------------------------------------------------------------------------- > > Key: HADOOP-1263 > URL: https://issues.apache.org/jira/browse/HADOOP-1263 > Project: Hadoop > Issue Type: Improvement > Components: dfs > Affects Versions: 0.12.3 > Reporter: Christian Kunz > Assigned To: Hairong Kuang > > Sometimes, when many (e.g. 1000+) map jobs start at about the same time and > require supporting files from filecache, it happens that some map tasks fail > because of rpc timeouts. With only the default number of 10 handlers on the > namenode, the probability is high that the whole job fails (see Hadoop-1182). > It is much better with a higher number of handlers, but some map tasks still > fail. > This could be avoided if rpc clients did retry when encountering a timeout > before throwing an exception. > Examples of exceptions: > java.net.SocketTimeoutException: timed out waiting for rpc response > at org.apache.hadoop.ipc.Client.call(Client.java:473) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163) > at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source) > at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320) > at > org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170) > at > org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125) > at > org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110) > at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245) > at > org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327) > at > org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253) > at > org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169) > at > org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86) > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117) > java.net.SocketTimeoutException: timed out waiting for rpc response > at org.apache.hadoop.ipc.Client.call(Client.java:473) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163) > at org.apache.hadoop.dfs.$Proxy1.open(Unknown Source) > at > org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:511) > at > org.apache.hadoop.dfs.DFSClient$DFSInputStream.<init>(DFSClient.java:498) > at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:207) > at > org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:129) > at > org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.<init>(ChecksumFileSystem.java:110) > at > org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:82) > at > org.apache.hadoop.fs.ChecksumFileSystem.copyToLocalFile(ChecksumFileSystem.java:577) > at > org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:766) > at > org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:370) > at > org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:877) > at > org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:545) > at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:913) > at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1603) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.