[ 
https://issues.apache.org/jira/browse/HADOOP-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raghu Angadi updated HADOOP-2758:
---------------------------------

    Attachment: HADOOP-2758.patch


Attached patch removes extra buffer copies when data is read from the data node 
(by client or while replicating).

 - before : disk --> large bufferedinputstream --> small datanode buffer --> 
large bufferedoutputstream --> socket.
 - after : disk --> large datanode buffer --> socket
 - each arrow represents a memory copy. cost of arrows at the ends is share 
between user and kernel, I think (using direct buffer might further reduce 
that, will try.). 

I will post more microbenchmarks similar to last comment.

We can reduce one copy on the DFSClient. Current {{readChunk()}} interface in 
{{FSInputChecker}} does not allow it. We could add optional {{readChunks()}} so 
that an implementation can get access to user's complete buffer. There will be 
a default implementation of this. Should I file a jira?

This patch changes the DATA_TRANSFER_PROTOCOL a bit. 

Currently there are no improvements in buffering whilre writing data to DFS. I 
will do that in a follow up jira.

All the unit tests pass. I will run them on windows as well. No new tests are 
added since this does not actually change any functionality and purely a 
performance improvement. 



> Reduce memory copies when data is read from DFS
> -----------------------------------------------
>
>                 Key: HADOOP-2758
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2758
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Raghu Angadi
>            Assignee: Raghu Angadi
>             Fix For: 0.17.0
>
>         Attachments: HADOOP-2758.patch
>
>
> Currently datanode and client part of DFS perform multiple copies of data on 
> the 'read path' (i.e. path from storage on datanode to user buffer on the 
> client). This jira reduces these copies by enhancing data read protocol and 
> implementation of read on both datanode and the client. I will describe the 
> changes in next comment.
> Requirement is that this fix should reduce CPU used and should not cause 
> regression in any benchmarks. It might not improve the benchmarks since most 
> benchmarks are not cpu bound.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to