[ 
https://issues.apache.org/jira/browse/HADOOP-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod K V updated HADOOP-5285:
------------------------------

    Attachment: trace.txt

After recovering from the hang, JT logged something as follows. Looks like 
initTasks() was blocked because of some DFS issue:


{code}
2009-02-19 09:52:53,132 INFO org.apache.hadoop.mapred.JobInProgress: Task 
'attempt_200902190419_0129_r_000178_0' has completed 
task_200902190419_0129_r_000178 successful
ly.
2009-02-19 10:03:29,445 WARN org.apache.hadoop.hdfs.DFSClient: Exception while 
reading from blk_2044238107768440002_840946 of 
/mapredsystem/hadoopqa/mapredsystem/job_200902190419_0419/job.split from 
<host:port>: java.net.SocketTimeoutException: 60000 millis timeout while 
waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/<host:port> 
remote=/<host:port>]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at java.io.DataInputStream.read(DataInputStream.java:132)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
        at 
org.apache.hadoop.hdfs.DFSClient$BlockReader.readChunk(DFSClient.java:1222)
        at 
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237)
        at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
        at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
        at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
        at 
org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1075)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1630)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1680)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:154)
        at 
org.apache.hadoop.mapred.JobClient$RawSplit.readFields(JobClient.java:983)
        at org.apache.hadoop.mapred.JobClient.readSplitFile(JobClient.java:1060)
        at 
org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:402)
        at 
org.apache.hadoop.mapred.EagerTaskInitializationListener$JobInitThread.run(EagerTaskInitializationListener.java:55)
        at java.lang.Thread.run(Thread.java:619)

2009-02-19 10:03:29,446 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain 
block blk_2044238107768440002_840946 from any node:  java.io.IOException: No 
live nodes contain current block
{code}

Also attaching the trace of JT when one such situation happened.

> JobTracker hangs for long periods of time
> -----------------------------------------
>
>                 Key: HADOOP-5285
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5285
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Vinod K V
>            Priority: Blocker
>             Fix For: 0.20.0
>
>         Attachments: trace.txt
>
>
> On one of the larger clusters of 2000 nodes, JT hanged quite often, sometimes 
> for times in the order of 10-15 minutes and once for one and a half hours(!). 
> The stack trace shows that JobInProgress.obtainTaskCleanupTask() is waiting 
> for lock on JobInProgress object which JobInProgress.initTasks() is holding 
> for a long time waiting for DFS operations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to