[ 
https://issues.apache.org/jira/browse/HADOOP-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613486#action_12613486
 ] 

Doug Cutting commented on HADOOP-3672:
--------------------------------------

Applications that perform random access don't always know how much they'll 
read.  For example, Lucene uses read(), not pread(), to retrieve posting lists. 
 Lucene could perhaps be modified so that it could provide lengths whenever it 
reads data.  So we'd ideally like random access performance to be good for both 
read() and pread().  Most filesystems optimize both cases, and consequently 
most applications are written assuming that a random read() will be reasonably 
efficient.

> are you proposing RPCs for all datanode transfers?

We need to understand whether there are hard reasons why we cannot use RPC for 
all network communications.  Right now, HDFS uses both RPC and raw TCP, and 
mapred uses RPC and HTTP.  Security, authentication and authorization would all 
be simpler if we used fewer communication mechanisms, plus we'd have a unified 
connection cache, etc.  But we obviously don't want to go that way if it will 
kill performance.


> support for persistent connections to improve random read performance.
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-3672
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3672
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.17.0
>         Environment: Linux 2.6.9-55  , Dual Core Opteron 280 2.4Ghz , 4GB 
> memory
>            Reporter: George Wu
>         Attachments: pread_test.java
>
>
> preads() establish new connections per request. yourkit java profiles show 
> that this connection overhead is pretty significant on the DataNode. 
> I wrote a simple microbenchmark program which does many iterations of pread() 
> from different offsets of a large file. I hacked DFSClient/DataNode code to 
> re-use the same connection/DataNode request handler thread. The performance 
> improvement was 7% when the data is served from disk and 80% when the data is 
> served from the OS page cache.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to