[ 
https://issues.apache.org/jira/browse/HADOOP-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12493136
 ] 

Tom White commented on HADOOP-1263:
-----------------------------------

It is worth reviewing the logging, especially regarding the log level/retry 
count combination. In particular RetryInvocationHandler always logs exceptions 
at warn level, which is probably wrong. If the operation is to be retried then 
log at info, and only log at warn when the operation will not be retried again. 
 (Also, it might be nice to change the last log message to say how many times 
the operation was retried before it finally failed.)

Also, could you add a unit test for ExponentialBackoffRetry please?

I notice that ExponentialBackoffRetry uses a delay that incorporates a random 
number. This is probably a good idea to avoid the problem you are trying to 
fix, but since the other policies are deterministic I would make this 
difference clear in the name: e.g. FuzzyExponentialBackoffRetry.

> retry logic when dfs exist or open fails temporarily, e.g because of timeout
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-1263
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1263
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>         Assigned To: Hairong Kuang
>         Attachments: retry.patch
>
>
> Sometimes, when many (e.g. 1000+) map jobs start at about the same time and 
> require supporting files from filecache, it happens that some map tasks fail 
> because of rpc timeouts. With only the default number of 10 handlers on the 
> namenode, the probability is high that the whole job fails (see Hadoop-1182). 
> It is much better with a higher number of handlers, but some map tasks still 
> fail.
> This could be avoided if rpc clients did retry when encountering a timeout 
> before throwing an exception.
> Examples of exceptions:
> java.net.SocketTimeoutException: timed out waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:473)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
> at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source)
> at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320)
> at 
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170)
> at 
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110)
> at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
> at 
> org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327)
> at 
> org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253)
> at 
> org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169)
> at 
> org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86)
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117)
> java.net.SocketTimeoutException: timed out waiting for rpc response
>         at org.apache.hadoop.ipc.Client.call(Client.java:473)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
>         at org.apache.hadoop.dfs.$Proxy1.open(Unknown Source)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:511)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.<init>(DFSClient.java:498)
>         at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:207)
>         at 
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:129)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.<init>(ChecksumFileSystem.java:110)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:82)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.copyToLocalFile(ChecksumFileSystem.java:577)
>         at 
> org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:766)
>         at 
> org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:370)
>         at 
> org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:877)
>         at 
> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:545)
>         at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:913)
>         at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1603)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to