[ https://issues.apache.org/jira/browse/HADOOP-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528613 ]
Christian Kunz commented on HADOOP-1874: ---------------------------------------- We ran the job again with lazy-dfs-ops.2.patch and server-throttle-hack.patch, and it failed again, mainly because the namenode could not handle the load when 2600 reduces closed about 100 files each, taking longer than 20 minutes to close. Related exceptions are: on the client side: 07/09/18 22:39:41 INFO fs.DFSClient: Could not complete file, retrying... 07/09/18 22:40:18 INFO fs.DFSClient: Could not complete file, retrying... on the name side: 2007-09-18 22:54:22,727 WARN org.apache.hadoop.ipc.Server: IPC Server handler 15 on 8600, call renewLease(DFSClient_-1403982964) from <ipaddress>:59329: output error java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) at org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer(SocketChannelOutputStream.java:108) at org.apache.hadoop.ipc.SocketChannelOutputStream.write(SocketChannelOutputStream.java:89) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at java.io.DataOutputStream.flush(DataOutputStream.java:106) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:628) > lost task trackers -- jobs hang > ------------------------------- > > Key: HADOOP-1874 > URL: https://issues.apache.org/jira/browse/HADOOP-1874 > Project: Hadoop > Issue Type: Bug > Components: mapred > Affects Versions: 0.15.0 > Reporter: Christian Kunz > Assignee: Devaraj Das > Priority: Blocker > Attachments: lazy-dfs-ops.1.patch, lazy-dfs-ops.2.patch, > lazy-dfs-ops.4.patch, lazy-dfs-ops.patch, server-throttle-hack.patch > > > This happens on a 1400 node cluster using a recent nightly build patched with > HADOOP-1763 (that fixes a previous 'lost task tracker' issue) running a > c++-pipes job with 4200 maps and 2800 reduces. The task trackers start to get > lost in high numbers at the end of job completion. > Similar non-pipes job do not show the same problem, but is unclear whether it > is related to c++-pipes. It could also be dfs overload when reduce tasks > close and validate all newly created dfs files. I see dfs client rpc timeout > exception. But this alone does not explain the escalation in losing task > trackers. > I also noticed that the job tracker becomes rather unresponsive with rpc > timeout and call queue overflow exceptions. Job Tracker is running with 60 > handlers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.