[
https://issues.apache.org/jira/browse/HDFS-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon updated HDFS-127:
-----------------------------
Attachment: hdfs-127-branch20-redone.txt
Here's a redone patch against branch-20. I've taken the approach of resetting
the failure counter at the start of the user-facing read and pread calls in
DFSInputStream. The logic here is that the failure counter should limit the
number of internal retries before throwing an exception back to the client. As
long as the client is making some progress, we don't care about the total
number of failures over the course of the stream, and it should be reset for
each operation.
I've also included two new unit tests. The first, in TestCrcCorruption, guards
against the error we saw with the original patch. It reliably reproduces the
infinite loop with the broken patch that was originally on branch-20. The
second new unit test, in TestDFSClientRetries, verifies the new behavior,
namely that a given DFSInputStream can continue to be used even when the
_total_ number of failures exceeds maxBlockAcquires, so long as the number of
retries on any given read() operation does not.
To accomplish the second test, I pulled in the mockito dependency via ivy. The
ability to inject bad block locations into the client made the test a lot more
straightforward, and I don't see any downsides to pulling it into branch-20.
> DFSClient block read failures cause open DFSInputStream to become unusable
> --------------------------------------------------------------------------
>
> Key: HDFS-127
> URL: https://issues.apache.org/jira/browse/HDFS-127
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs client
> Reporter: Igor Bolotin
> Assignee: Igor Bolotin
> Fix For: 0.21.0, 0.22.0
>
> Attachments: 4681.patch, h127_20091016.patch, h127_20091019.patch,
> h127_20091019b.patch, hdfs-127-branch20-redone.txt,
> hdfs-127-regression-test.txt
>
>
> We are using some Lucene indexes directly from HDFS and for quite long time
> we were using Hadoop version 0.15.3.
> When tried to upgrade to Hadoop 0.19 - index searches started to fail with
> exceptions like:
> 2008-11-13 16:50:20,314 WARN [Listener-4] [] DFSClient : DFS Read:
> java.io.IOException: Could not obtain block: blk_5604690829708125511_15489
> file=/usr/collarity/data/urls-new/part-00000/20081110-163426/_0.tis
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1708)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1536)
> at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1663)
> at java.io.DataInputStream.read(DataInputStream.java:132)
> at
> org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:174)
> at
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152)
> at
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
> at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
> at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63)
> at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131)
> at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162)
> at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223)
> at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217)
> at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54)
> ...
> The investigation showed that the root of this issue is that we exceeded # of
> xcievers in the data nodes and that was fixed by changing configuration
> settings to 2k.
> However - one thing that bothered me was that even after datanodes recovered
> from overload and most of client servers had been shut down - we still
> observed errors in the logs of running servers.
> Further investigation showed that fix for HADOOP-1911 introduced another
> problem - the DFSInputStream instance might become unusable once number of
> failures over lifetime of this instance exceeds configured threshold.
> The fix for this specific issue seems to be trivial - just reset failure
> counter before reading next block (patch will be attached shortly).
> This seems to be also related to HADOOP-3185, but I'm not sure I really
> understand necessity of keeping track of failed block accesses in the DFS
> client.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.