[ 
https://issues.apache.org/jira/browse/HADOOP-4681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713255#action_12713255
 ] 

Raghu Angadi commented on HADOOP-4681:
--------------------------------------

This bug exists and I don't think the current patch is the right one (yet). We 
probably don't need this variable at all (see below).

Looking at how 'failures' variable is used, it is pretty limited. My guess is 
that it was there right from the beginning and a lot of DFSClient has changed 
around it since then.

I would say we need description of what it means : i.e. when it should be 
incremented and why there should be a limit. That will also answer when it 
should be reset. 

As I see it now : it is incremented only  when connect to a datanode fails. 
That implies, it should be reset when such connect succeeds (in 
chooseDataNode()). 

But this is still not enough since it does not allow DFSClient to try all the 
replicas available (what if number of replicas is larger than 3?). May be we 
should try each datanode once (or twice)...That implies we probably don't need 
this variable at all. Just some local counter in 'chooseDataNode()' would do.


> DFSClient block read failures cause open DFSInputStream to become unusable
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-4681
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4681
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.2, 0.19.0, 0.19.1, 0.20.0
>            Reporter: Igor Bolotin
>             Fix For: 0.19.2
>
>         Attachments: 4681.patch
>
>
> We are using some Lucene indexes directly from HDFS and for quite long time 
> we were using Hadoop version 0.15.3.
> When tried to upgrade to Hadoop 0.19 - index searches started to fail with 
> exceptions like:
> 2008-11-13 16:50:20,314 WARN [Listener-4] [] DFSClient : DFS Read: 
> java.io.IOException: Could not obtain block: blk_5604690829708125511_15489 
> file=/usr/collarity/data/urls-new/part-00000/20081110-163426/_0.tis
> at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1708)
> at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1536)
> at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1663)
> at java.io.DataInputStream.read(DataInputStream.java:132)
> at 
> org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:174)
> at 
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152)
> at 
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
> at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
> at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63)
> at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131)
> at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162)
> at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223)
> at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217)
> at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54) 
> ...
> The investigation showed that the root of this issue is that we exceeded # of 
> xcievers in the data nodes and that was fixed by changing configuration 
> settings to 2k.
> However - one thing that bothered me was that even after datanodes recovered 
> from overload and most of client servers had been shut down - we still 
> observed errors in the logs of running servers.
> Further investigation showed that fix for HADOOP-1911 introduced another 
> problem - the DFSInputStream instance might become unusable once number of 
> failures over lifetime of this instance exceeds configured threshold.
> The fix for this specific issue seems to be trivial - just reset failure 
> counter before reading next block (patch will be attached shortly).
> This seems to be also related to HADOOP-3185, but I'm not sure I really 
> understand necessity of keeping track of failed block accesses in the DFS 
> client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to