[ 
https://issues.apache.org/jira/browse/HDFS-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-127:
-----------------------------

    Attachment: hdfs-127-branch20-redone.txt

Here's a redone patch against branch-20. I've taken the approach of resetting 
the failure counter at the start of the user-facing read and pread calls in 
DFSInputStream. The logic here is that the failure counter should limit the 
number of internal retries before throwing an exception back to the client. As 
long as the client is making some progress, we don't care about the total 
number of failures over the course of the stream, and it should be reset for 
each operation.

I've also included two new unit tests. The first, in TestCrcCorruption, guards 
against the error we saw with the original patch. It reliably reproduces the 
infinite loop with the broken patch that was originally on branch-20. The 
second new unit test, in TestDFSClientRetries, verifies the new behavior, 
namely that a given DFSInputStream can continue to be used even when the 
_total_ number of failures exceeds maxBlockAcquires, so long as the number of 
retries on any given read() operation does not.

To accomplish the second test, I pulled in the mockito dependency via ivy. The 
ability to inject bad block locations into the client made the test a lot more 
straightforward, and I don't see any downsides to pulling it into branch-20.


> DFSClient block read failures cause open DFSInputStream to become unusable
> --------------------------------------------------------------------------
>
>                 Key: HDFS-127
>                 URL: https://issues.apache.org/jira/browse/HDFS-127
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client
>            Reporter: Igor Bolotin
>            Assignee: Igor Bolotin
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: 4681.patch, h127_20091016.patch, h127_20091019.patch, 
> h127_20091019b.patch, hdfs-127-branch20-redone.txt, 
> hdfs-127-regression-test.txt
>
>
> We are using some Lucene indexes directly from HDFS and for quite long time 
> we were using Hadoop version 0.15.3.
> When tried to upgrade to Hadoop 0.19 - index searches started to fail with 
> exceptions like:
> 2008-11-13 16:50:20,314 WARN [Listener-4] [] DFSClient : DFS Read: 
> java.io.IOException: Could not obtain block: blk_5604690829708125511_15489 
> file=/usr/collarity/data/urls-new/part-00000/20081110-163426/_0.tis
> at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1708)
> at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1536)
> at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1663)
> at java.io.DataInputStream.read(DataInputStream.java:132)
> at 
> org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:174)
> at 
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152)
> at 
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
> at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
> at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63)
> at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131)
> at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162)
> at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223)
> at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217)
> at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54) 
> ...
> The investigation showed that the root of this issue is that we exceeded # of 
> xcievers in the data nodes and that was fixed by changing configuration 
> settings to 2k.
> However - one thing that bothered me was that even after datanodes recovered 
> from overload and most of client servers had been shut down - we still 
> observed errors in the logs of running servers.
> Further investigation showed that fix for HADOOP-1911 introduced another 
> problem - the DFSInputStream instance might become unusable once number of 
> failures over lifetime of this instance exceeds configured threshold.
> The fix for this specific issue seems to be trivial - just reset failure 
> counter before reading next block (patch will be attached shortly).
> This seems to be also related to HADOOP-3185, but I'm not sure I really 
> understand necessity of keeping track of failed block accesses in the DFS 
> client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to