scalability issue with filecache in large clusters
--------------------------------------------------
Key: HADOOP-1182
URL: https://issues.apache.org/jira/browse/HADOOP-1182
Project: Hadoop
Issue Type: Bug
Components: mapred
Affects Versions: 0.12.1
Reporter: Christian Kunz
When using filecache to distribute supporting files for map/reduce applications
in a 1000 node cluster, many map tasks fail because of timeouts. There was no
such problem using a 200 node cluster for the same applications with comparable
input data. Either the whole job fails because of too many map failures, or
even worse, some map tasks hang indefinitely.
java.net.SocketTimeoutException: timed out waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:473)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source)
at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320)
at
org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170)
at
org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125)
at
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
at
org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327)
at
org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253)
at
org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169)
at
org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.