[jira] [Commented] (HDFS-6607) DFSInputStream Seek performance improvement

Steve Loughran (JIRA) Mon, 30 Jun 2014 04:18:07 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047564#comment-14047564
 ]


Steve Loughran commented on HDFS-6607:
--------------------------------------

Some of the object store streams (e.g. for Swift) do this too -the cost of a 
seek is very expensive there.

what might be useful is moving this to BufferedInputStream, which already has 
some buffered operations -it could be enhanced to also skip forward some bytes 
on a read. Or factor out the skip logic in some other way so that we stop 
having to replicate it everywhere. 

> DFSInputStream Seek performance improvement
> -------------------------------------------
>
>                 Key: HDFS-6607
>                 URL: https://issues.apache.org/jira/browse/HDFS-6607
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client, performance
>    Affects Versions: 2.4.1
>            Reporter: Abdullah Alamoudi
>
> When having a DFSInputStream open and seeking to a position that resides in 
> the same block, if the target position is in the TCP buffer already, the seek 
> is performed efficiently simply by eating up the intervening data. See line 
> 1368 in the file: 
> hadoop-common/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java.
> However, if the position is in the same block but after the TCP buffer, the 
> inputstream performs a set of actions including closing the current block 
> reader, locating the block again, selecting a data node and creating a new 
> block reader. During this, many objects are created and all of this is very 
> inefficient for users with random access needs (e.g index access).
> I have conducted some experiments which showed that reading 3,000,000 records 
> using seeks and reads is slower than reading 60,000,000 records using seeks 
> and reads as well which shows the need to improve the seek implementation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-6607) DFSInputStream Seek performance improvement

Reply via email to