[jira] Updated: (HDFS-127) DFSClient block read failures cause open DFSInputStream to become unusable

Tsz Wo (Nicholas), SZE (JIRA) Fri, 16 Oct 2009 18:31:56 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tsz Wo (Nicholas), SZE updated HDFS-127:
----------------------------------------

    Component/s: hdfs client

> ... Any reason for making the bestNode static? You've changed what bestNode 
> returns. It now returns null rather than throw an IOE with "no live nodes 
> contain current block". Us over in hbase are not going to miss that message.

It is because bestNode(..) is a simple utility method.  It does not use any 
DFSClient fields/methods.

In the original codes, bestNode(..) is invoked only in chooseDataNode(..), 
which basically use a try-catch to deal with the IOException thrown by 
bestNode(..).
{code}
//original codes in chooseDataNode(..)
        try {
          DatanodeInfo chosenNode = bestNode(nodes, deadNodes);
          ...
        } catch (IOException ie) {
          ...
          LOG.info("Could not obtain block " + block.getBlock()
              + " from any node: " + ie
              + ". Will get new block locations from namenode and retry...");
          ...
        }
{code}
As shown above, the IOException ie is used only for a log message.  I think "No 
live nodes contain current block" is a kind of redundant in the log message.  
Do you think that  "Could not obtain block " + block.getBlock() + " from any 
node" is clear enough?  No?  We may change the log message if necessary.

> DFSClient block read failures cause open DFSInputStream to become unusable
> --------------------------------------------------------------------------
>
>                 Key: HDFS-127
>                 URL: https://issues.apache.org/jira/browse/HDFS-127
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client
>            Reporter: Igor Bolotin
>         Attachments: 4681.patch, h127_20091016.patch
>
>
> We are using some Lucene indexes directly from HDFS and for quite long time 
> we were using Hadoop version 0.15.3.
> When tried to upgrade to Hadoop 0.19 - index searches started to fail with 
> exceptions like:
> 2008-11-13 16:50:20,314 WARN [Listener-4] [] DFSClient : DFS Read: 
> java.io.IOException: Could not obtain block: blk_5604690829708125511_15489 
> file=/usr/collarity/data/urls-new/part-00000/20081110-163426/_0.tis
> at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1708)
> at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1536)
> at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1663)
> at java.io.DataInputStream.read(DataInputStream.java:132)
> at 
> org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:174)
> at 
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152)
> at 
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
> at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
> at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63)
> at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131)
> at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162)
> at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223)
> at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217)
> at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54) 
> ...
> The investigation showed that the root of this issue is that we exceeded # of 
> xcievers in the data nodes and that was fixed by changing configuration 
> settings to 2k.
> However - one thing that bothered me was that even after datanodes recovered 
> from overload and most of client servers had been shut down - we still 
> observed errors in the logs of running servers.
> Further investigation showed that fix for HADOOP-1911 introduced another 
> problem - the DFSInputStream instance might become unusable once number of 
> failures over lifetime of this instance exceeds configured threshold.
> The fix for this specific issue seems to be trivial - just reset failure 
> counter before reading next block (patch will be attached shortly).
> This seems to be also related to HADOOP-3185, but I'm not sure I really 
> understand necessity of keeping track of failed block accesses in the DFS 
> client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-127) DFSClient block read failures cause open DFSInputStream to become unusable

Reply via email to