[jira] Commented: (HADOOP-1182) scalability issue with filecache in large clusters

dhruba borthakur (JIRA) Thu, 29 Mar 2007 09:49:46 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485267
 ]


dhruba borthakur commented on HADOOP-1182:
------------------------------------------

It would help us a lot if you can monitor the CPU usage on the namenode when 
this occurs. Was the CPU close to 100%? Also, do you see messages of the 
following type in the namenode log?

"Call queue overflow discarding oldest call"

> scalability issue with filecache in large clusters
> --------------------------------------------------
>
>                 Key: HADOOP-1182
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1182
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.12.1
>            Reporter: Christian Kunz
>
> When using filecache to distribute supporting files for map/reduce 
> applications in a 1000 node cluster, many map tasks fail  because of 
> timeouts. There was no such problem using a 200 node cluster for the same 
> applications with comparable input data. Either the whole job fails because 
> of too many map failures, or even worse, some map tasks hang indefinitely.
> java.net.SocketTimeoutException: timed out waiting for rpc response
>       at org.apache.hadoop.ipc.Client.call(Client.java:473)
>       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
>       at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source)
>       at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320)
>       at 
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170)
>       at 
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125)
>       at 
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110)
>       at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
>       at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
>       at 
> org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327)
>       at 
> org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253)
>       at 
> org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169)
>       at 
> org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86)
>       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1182) scalability issue with filecache in large clusters

Reply via email to