[ https://issues.apache.org/jira/browse/HADOOP-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinod K V updated HADOOP-5285: ------------------------------ Attachment: trace.txt After recovering from the hang, JT logged something as follows. Looks like initTasks() was blocked because of some DFS issue: {code} 2009-02-19 09:52:53,132 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_200902190419_0129_r_000178_0' has completed task_200902190419_0129_r_000178 successful ly. 2009-02-19 10:03:29,445 WARN org.apache.hadoop.hdfs.DFSClient: Exception while reading from blk_2044238107768440002_840946 of /mapredsystem/hadoopqa/mapredsystem/job_200902190419_0419/job.split from <host:port>: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/<host:port> remote=/<host:port>] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at java.io.DataInputStream.read(DataInputStream.java:132) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100) at org.apache.hadoop.hdfs.DFSClient$BlockReader.readChunk(DFSClient.java:1222) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237) at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176) at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193) at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158) at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1075) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1630) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1680) at java.io.DataInputStream.readFully(DataInputStream.java:178) at org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:154) at org.apache.hadoop.mapred.JobClient$RawSplit.readFields(JobClient.java:983) at org.apache.hadoop.mapred.JobClient.readSplitFile(JobClient.java:1060) at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:402) at org.apache.hadoop.mapred.EagerTaskInitializationListener$JobInitThread.run(EagerTaskInitializationListener.java:55) at java.lang.Thread.run(Thread.java:619) 2009-02-19 10:03:29,446 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_2044238107768440002_840946 from any node: java.io.IOException: No live nodes contain current block {code} Also attaching the trace of JT when one such situation happened. > JobTracker hangs for long periods of time > ----------------------------------------- > > Key: HADOOP-5285 > URL: https://issues.apache.org/jira/browse/HADOOP-5285 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Affects Versions: 0.20.0 > Reporter: Vinod K V > Priority: Blocker > Fix For: 0.20.0 > > Attachments: trace.txt > > > On one of the larger clusters of 2000 nodes, JT hanged quite often, sometimes > for times in the order of 10-15 minutes and once for one and a half hours(!). > The stack trace shows that JobInProgress.obtainTaskCleanupTask() is waiting > for lock on JobInProgress object which JobInProgress.initTasks() is holding > for a long time waiting for DFS operations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.