[
https://issues.apache.org/jira/browse/HDFS-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047564#comment-14047564
]
Steve Loughran commented on HDFS-6607:
--------------------------------------
Some of the object store streams (e.g. for Swift) do this too -the cost of a
seek is very expensive there.
what might be useful is moving this to BufferedInputStream, which already has
some buffered operations -it could be enhanced to also skip forward some bytes
on a read. Or factor out the skip logic in some other way so that we stop
having to replicate it everywhere.
> DFSInputStream Seek performance improvement
> -------------------------------------------
>
> Key: HDFS-6607
> URL: https://issues.apache.org/jira/browse/HDFS-6607
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs-client, performance
> Affects Versions: 2.4.1
> Reporter: Abdullah Alamoudi
>
> When having a DFSInputStream open and seeking to a position that resides in
> the same block, if the target position is in the TCP buffer already, the seek
> is performed efficiently simply by eating up the intervening data. See line
> 1368 in the file:
> hadoop-common/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java.
> However, if the position is in the same block but after the TCP buffer, the
> inputstream performs a set of actions including closing the current block
> reader, locating the block again, selecting a data node and creating a new
> block reader. During this, many objects are created and all of this is very
> inefficient for users with random access needs (e.g index access).
> I have conducted some experiments which showed that reading 3,000,000 records
> using seeks and reads is slower than reading 60,000,000 records using seeks
> and reads as well which shows the need to improve the seek implementation.
--
This message was sent by Atlassian JIRA
(v6.2#6252)