[jira] [Moved] (MAPREDUCE-2813) Tasks freeze with "No live nodes contain current block", job takes long time to recover

Eli Collins (JIRA) Thu, 11 Aug 2011 11:13:17 -0700

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eli Collins moved HADOOP-5361 to MAPREDUCE-2813:
------------------------------------------------

    Affects Version/s:     (was: 0.21.0)
                       0.21.0
                  Key: MAPREDUCE-2813  (was: HADOOP-5361)
              Project: Hadoop Map/Reduce  (was: Hadoop Common)

> Tasks freeze with "No live nodes contain current block", job takes long time 
> to recover
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2813
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2813
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.21.0
>            Reporter: Matei Zaharia
>
> Running a recent version of trunk on 100 nodes, I occasionally see some tasks 
> freeze at startup and hang the job. These tasks are not speculatively 
> executed either. Here's sample output from one of them:
> {noformat}
> 2009-02-27 15:19:10,229 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
> Initializing JVM Metrics with processName=MAP, sessionId=
> 2009-02-27 15:19:10,486 INFO org.apache.hadoop.mapred.MapTask: 
> numReduceTasks: 0
> 2009-02-27 15:21:20,952 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
> obtain block blk_2086525142250101885_39076 from any node:  
> java.io.IOException: No live nodes contain current block
> 2009-02-27 15:23:23,972 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
> obtain block blk_2086525142250101885_39076 from any node:  
> java.io.IOException: No live nodes contain current block
> 2009-02-27 15:25:26,992 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
> obtain block blk_2086525142250101885_39076 from any node:  
> java.io.IOException: No live nodes contain current block
> 2009-02-27 15:27:30,012 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read: 
> java.io.IOException: Could not obtain block: blk_2086525142250101885_39076 
> file=/user/root/rand2/part-00864
>     at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)
>     at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)
>     at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
>     at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>     at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
>     at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
>     at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
>     at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>     at org.apache.hadoop.mapred.Child.main(Child.java:155)
> 2009-02-27 15:27:30,018 WARN org.apache.hadoop.mapred.TaskTracker: Error 
> running child
> java.io.IOException: Could not obtain block: blk_2086525142250101885_39076 
> file=/user/root/rand2/part-00864
>     at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)
>     at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)
>     at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
>     at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>     at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
>     at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
>     at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
>     at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>     at org.apache.hadoop.mapred.Child.main(Child.java:155)
> {noformat}
> Note how the DFS client fails multiple times to retrieve the block, with a 2 
> minute wait between each one, without giving up. During this time, the task 
> is *not* speculated. However, once this task finally failed, a new version of 
> it ran successfully. Getting the input file in question with bin/hadoop fs 
> -get also worked fine.
> There is no mention of the task attempt in question in the NameNode logs but 
> my guess is that something to do with RPC queues is causing its connection to 
> get lost, and the DFSClient does not recover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Moved] (MAPREDUCE-2813) Tasks freeze with "No live nodes contain current block", job takes long time to recover

Reply via email to