[ 
https://issues.apache.org/jira/browse/HDFS-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049069#comment-14049069
 ] 

Abdullah Alamoudi commented on HDFS-6607:
-----------------------------------------

This problem doesn't exist in the local block reader unless the seek is going 
backward, but it is definitely there for the remote block reader.

> Improve DFSInputStream forward seek performance
> -----------------------------------------------
>
>                 Key: HDFS-6607
>                 URL: https://issues.apache.org/jira/browse/HDFS-6607
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client, performance
>    Affects Versions: 2.4.1
>            Reporter: Abdullah Alamoudi
>
> When having a DFSInputStream open and seeking to a position that resides in 
> the same block, if the target position is in the TCP buffer already, the seek 
> is performed efficiently simply by eating up the intervening data. See line 
> 1368 in the file: 
> hadoop-common/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java.
> However, if the position is in the same block but after the TCP buffer, the 
> inputstream performs a set of actions including closing the current block 
> reader, locating the block again, selecting a data node and creating a new 
> block reader. During this, many objects are created and all of this is very 
> inefficient for users with random access needs (e.g index access).
> I have conducted some experiments which showed that reading 3,000,000 records 
> using seeks and reads is slower than reading 60,000,000 records using seeks 
> and reads as well which shows the need to improve the seek implementation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to