[ 
https://issues.apache.org/jira/browse/HADOOP-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12585795#action_12585795
 ] 

Chris Douglas commented on HADOOP-1911:
---------------------------------------

There is a very simple "fix" to this, i.e. make the "failures" count an 
instance var on DFSInputStream rather than a local variable in chooseDataNode. 
This would make the semantics of MAX_BLOCK_ACQUIRE_FAILURES to be a cap on the 
number of total block acquisition failures for the life of the stream, which is 
not exactly correct, but it is a fix we could easily get into 0.17. It will 
yield false negatives for a particularly problematic stream, but for 
applications like distcp it should be sufficient.

After consulting with Dhruba, the longer-term fix will track failures not using 
a list of "deadnodes", but rather a map of blocks to a list of deadnodes and- 
to preserve the retry semantics- a map of blocks to full acquisition failures. 
Right now, a datanode that fails to serve a block is blacklisted on the stream 
until there are no replicas available for some block, when the list is cleared. 
The false negatives this yields require the existing, problematic retry 
semantics. After confirming this approach with Koji, I'll file a JIRA for the 
more correct fix and submit the sufficient one for 0.17

> infinite loop in dfs -cat command.
> ----------------------------------
>
>                 Key: HADOOP-1911
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1911
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.13.1, 0.14.3
>            Reporter: Koji Noguchi
>            Assignee: Chris Douglas
>            Priority: Blocker
>             Fix For: 0.17.0
>
>
> [knoguchi]$ hadoop dfs -cat fileA
> 07/09/13 17:36:02 INFO fs.DFSClient: Could not obtain block 0 from any node: 
> java.io.IOException: No live nodes contain current block
> 07/09/13 17:36:20 INFO fs.DFSClient: Could not obtain block 0 from any node: 
> java.io.IOException: No live nodes contain current block
> [repeats forever]
> Setting one of the Debug statement to Warn, it kept on showing 
> {noformat} 
>  WARN org.apache.hadoop.fs.DFSClient: Failed to connect
> to /99.99.999.9 :11111:java.io.IOException: Recorded block size is 7496, but
> datanode reports size of 0
>       at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:690)
>       at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:771)
>       at 
> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
>       at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>       at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>       at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>       at java.io.DataInputStream.readFully(DataInputStream.java:178)
>       at java.io.DataInputStream.readFully(DataInputStream.java:152)
>       at 
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:123)
>       at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:340)
>       at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:259)
>       at 
> org.apache.hadoop.util.CopyFiles$FSCopyFilesMapper.map(CopyFiles.java:466)
>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:186)
>       at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1707)
> {noformat} 
> Turns out fileA was corrupted. Fsck showed crc file of 7496 bytes, but when I 
> searched for the blocks on each node, 3 replicas were all size 0.
> Not sure how it got corrupted, but it would be nice if the dfs command fail 
> instead of getting into an infinite loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to