[ https://issues.apache.org/jira/browse/HADOOP-4681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713401#action_12713401 ]
Raghu Angadi commented on HADOOP-4681: -------------------------------------- > Wouldn't this revert- and reintroduce- HADOOP-1911? I see. I just looked at HADOOP-1911 and I don't think it fixed the real problem. The loop is because of combination of reset of dead nodes in chooseDataNode() and 'while (s != null)' loop in blockSeekTo(). Note that actual failure occurs while trying to create BlockReader().. not in chooseDataNode(). The problem is that chooseDataNode() can not decide if the datanode is ok or not.. just an address for DN is not enough. we also need connect(), 'success reply' and at least a few bytes read from the datanode. So HADOOP-1911 fixed the infinite loop, but not for right reasons. We could define successful datanode conenction as 'being able to read non zero bytes that we need'. A failure count keeps growing until there is a 'successful connection', and should be reset after that. (Some what similar approach to HADOOP-3831). I think this time we should have a explicitly stated policy of when a hard failure occurs (and may be when we refetch the block data etc). > DFSClient block read failures cause open DFSInputStream to become unusable > -------------------------------------------------------------------------- > > Key: HADOOP-4681 > URL: https://issues.apache.org/jira/browse/HADOOP-4681 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Affects Versions: 0.18.2, 0.19.0, 0.19.1, 0.20.0 > Reporter: Igor Bolotin > Fix For: 0.19.2 > > Attachments: 4681.patch > > > We are using some Lucene indexes directly from HDFS and for quite long time > we were using Hadoop version 0.15.3. > When tried to upgrade to Hadoop 0.19 - index searches started to fail with > exceptions like: > 2008-11-13 16:50:20,314 WARN [Listener-4] [] DFSClient : DFS Read: > java.io.IOException: Could not obtain block: blk_5604690829708125511_15489 > file=/usr/collarity/data/urls-new/part-00000/20081110-163426/_0.tis > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1708) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1536) > at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1663) > at java.io.DataInputStream.read(DataInputStream.java:132) > at > org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:174) > at > org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152) > at > org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) > at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76) > at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63) > at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131) > at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162) > at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223) > at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217) > at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54) > ... > The investigation showed that the root of this issue is that we exceeded # of > xcievers in the data nodes and that was fixed by changing configuration > settings to 2k. > However - one thing that bothered me was that even after datanodes recovered > from overload and most of client servers had been shut down - we still > observed errors in the logs of running servers. > Further investigation showed that fix for HADOOP-1911 introduced another > problem - the DFSInputStream instance might become unusable once number of > failures over lifetime of this instance exceeds configured threshold. > The fix for this specific issue seems to be trivial - just reset failure > counter before reading next block (patch will be attached shortly). > This seems to be also related to HADOOP-3185, but I'm not sure I really > understand necessity of keeping track of failed block accesses in the DFS > client. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.