[jira] Commented: (HADOOP-1263) retry logic when dfs exist or open fails temporarily, e.g because of timeout

Hairong Kuang (JIRA) Tue, 01 May 2007 13:30:36 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492938
 ]


Hairong Kuang commented on HADOOP-1263:
---------------------------------------

> The think the create()and cleanup() RPCs can be retried too.
I believe that Dhruba meant create() and complete(). +1 on his suggestion.

> retry logic when dfs exist or open fails temporarily, e.g because of timeout
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-1263
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1263
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Hairong Kuang
>
> Sometimes, when many (e.g. 1000+) map jobs start at about the same time and 
> require supporting files from filecache, it happens that some map tasks fail 
> because of rpc timeouts. With only the default number of 10 handlers on the 
> namenode, the probability is high that the whole job fails (see Hadoop-1182). 
> It is much better with a higher number of handlers, but some map tasks still 
> fail.
> This could be avoided if rpc clients did retry when encountering a timeout 
> before throwing an exception.
> Examples of exceptions:
> java.net.SocketTimeoutException: timed out waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:473)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
> at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source)
> at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320)
> at 
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170)
> at 
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110)
> at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
> at 
> org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327)
> at 
> org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253)
> at 
> org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169)
> at 
> org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86)
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117)
> java.net.SocketTimeoutException: timed out waiting for rpc response
>         at org.apache.hadoop.ipc.Client.call(Client.java:473)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
>         at org.apache.hadoop.dfs.$Proxy1.open(Unknown Source)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:511)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.<init>(DFSClient.java:498)
>         at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:207)
>         at 
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:129)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.<init>(ChecksumFileSystem.java:110)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:82)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.copyToLocalFile(ChecksumFileSystem.java:577)
>         at 
> org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:766)
>         at 
> org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:370)
>         at 
> org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:877)
>         at 
> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:545)
>         at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:913)
>         at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1603)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1263) retry logic when dfs exist or open fails temporarily, e.g because of timeout

Reply via email to