[ 
https://issues.apache.org/jira/browse/HDFS-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13892476#comment-13892476
 ] 

Todd Lipcon commented on HDFS-5182:
-----------------------------------

I think you guys are talking past each other here.

Haohui -- I think you're proposing something which in fact HDFS already has 
implemented a while back, which may be why Colin is confused. In HDFS-347 we 
added support for the DN to pass the file descriptor to the client, and the 
client uses normal read() syscalls to access the data with the normal 
semantics. This is already in released versions and is heavily used by 
high-performance applications (eg HBase or Impala). We generally refer to this 
feature as "short-circuit read".

Then, in the current work on caching, we noticed that, if data is in-memory and 
has already been touched once (minor page-faulted), the performance you can get 
from mmap-based access is significantly better than what you get by calling 
read, since you avoid a memory copy. This is the new "zero-copy read" API 
introduced in HDFS-4953. In particular see [this 
comment|https://issues.apache.org/jira/browse/HDFS-4953?focusedCommentId=13707586&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13707586]
 in which I described a benchmark where zero-copy beats short-circuit by about 
3x.

Of course, zero-copy read is not always safe if the data is not mlocked, for 
the reasons you identified - in particular there is the chance of SIGBUS on an 
IO error. So, this JIRA seeks to make the client automatically switch back and 
forth between zero-copy read and short-circuit read (ie read syscall) based on 
whether the datanode side has mlocked the memory. The options that Colin 
proposed in the earlier comment are discussing the _control plane_ -- 
specifically how the DN communicates to the client whether an area of the file 
is safe to zero-copy or not. The option that he settled on for the control 
plane is a shared-memory segment, which happens to be implemented using an 
mmapped tmpfs file (a very common mechanism for shared memory inter-process 
communication on Linux)

None of this feature is meant to remove the original short-circuit 
implementation, and in fact the fd-passing is used both for short-circuit reads 
(read() on the passed fd) _and_ for the zero-copy reads (access to an mmap() of 
the passed fd).

Hope that helps resolve the confusion above.

> BlockReaderLocal must allow zero-copy  reads only when the DN believes it's 
> valid
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-5182
>                 URL: https://issues.apache.org/jira/browse/HDFS-5182
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client
>    Affects Versions: 3.0.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>
> BlockReaderLocal must allow zero-copy reads only when the DN believes it's 
> valid.  This implies adding a new field to the response to 
> REQUEST_SHORT_CIRCUIT_FDS.  We also need some kind of heartbeat from the 
> client to the DN, so that the DN can inform the client when the mapped region 
> is no longer locked into memory.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to