[ 
https://issues.apache.org/jira/browse/HDFS-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048489#comment-14048489
 ] 

Liang Xie commented on HDFS-6607:
---------------------------------

bq. reading 3,000,000 records using seeks and reads is slower than reading 
60,000,000 records using seeks and reads as well which shows the need to 
improve the seek implementation
sorry, could you describe more? i can not get it...

Per my understanding, the heavy stuff is not related with seek() directly, 
because inside seek(long) method, there's no costly operation. your mentioned 
performance issue should related with read() impl,  inside it, we decide to 
create new block reader sometimes, which probably need NN request and DN socket 
conn, etc.

> Improve DFSInputStream forward seek performance
> -----------------------------------------------
>
>                 Key: HDFS-6607
>                 URL: https://issues.apache.org/jira/browse/HDFS-6607
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client, performance
>    Affects Versions: 2.4.1
>            Reporter: Abdullah Alamoudi
>
> When having a DFSInputStream open and seeking to a position that resides in 
> the same block, if the target position is in the TCP buffer already, the seek 
> is performed efficiently simply by eating up the intervening data. See line 
> 1368 in the file: 
> hadoop-common/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java.
> However, if the position is in the same block but after the TCP buffer, the 
> inputstream performs a set of actions including closing the current block 
> reader, locating the block again, selecting a data node and creating a new 
> block reader. During this, many objects are created and all of this is very 
> inefficient for users with random access needs (e.g index access).
> I have conducted some experiments which showed that reading 3,000,000 records 
> using seeks and reads is slower than reading 60,000,000 records using seeks 
> and reads as well which shows the need to improve the seek implementation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to