[ https://issues.apache.org/jira/browse/HADOOP-4681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717190#action_12717190 ]
Jonathan Gray commented on HADOOP-4681: --------------------------------------- I just tried this patch after getting a lot of bad blocks reported under heavy load from HBase. After applying this, I can now get through all my load tests without a problem. The datanode is heavily loaded and HBase takes a while to perform compactions (~1min worst case) but it manages to get through it whereas without the patch it crapped out and I wasn't able to recover easily. I'm running an otherwise clean hadoop 0.20.0 > DFSClient block read failures cause open DFSInputStream to become unusable > -------------------------------------------------------------------------- > > Key: HADOOP-4681 > URL: https://issues.apache.org/jira/browse/HADOOP-4681 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Affects Versions: 0.18.2, 0.19.0, 0.19.1, 0.20.0 > Reporter: Igor Bolotin > Fix For: 0.19.2 > > Attachments: 4681.patch > > > We are using some Lucene indexes directly from HDFS and for quite long time > we were using Hadoop version 0.15.3. > When tried to upgrade to Hadoop 0.19 - index searches started to fail with > exceptions like: > 2008-11-13 16:50:20,314 WARN [Listener-4] [] DFSClient : DFS Read: > java.io.IOException: Could not obtain block: blk_5604690829708125511_15489 > file=/usr/collarity/data/urls-new/part-00000/20081110-163426/_0.tis > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1708) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1536) > at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1663) > at java.io.DataInputStream.read(DataInputStream.java:132) > at > org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:174) > at > org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152) > at > org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) > at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76) > at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63) > at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131) > at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162) > at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223) > at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217) > at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54) > ... > The investigation showed that the root of this issue is that we exceeded # of > xcievers in the data nodes and that was fixed by changing configuration > settings to 2k. > However - one thing that bothered me was that even after datanodes recovered > from overload and most of client servers had been shut down - we still > observed errors in the logs of running servers. > Further investigation showed that fix for HADOOP-1911 introduced another > problem - the DFSInputStream instance might become unusable once number of > failures over lifetime of this instance exceeds configured threshold. > The fix for this specific issue seems to be trivial - just reset failure > counter before reading next block (patch will be attached shortly). > This seems to be also related to HADOOP-3185, but I'm not sure I really > understand necessity of keeping track of failed block accesses in the DFS > client. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.