[
https://issues.apache.org/jira/browse/HDFS-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13892476#comment-13892476
]
Todd Lipcon commented on HDFS-5182:
-----------------------------------
I think you guys are talking past each other here.
Haohui -- I think you're proposing something which in fact HDFS already has
implemented a while back, which may be why Colin is confused. In HDFS-347 we
added support for the DN to pass the file descriptor to the client, and the
client uses normal read() syscalls to access the data with the normal
semantics. This is already in released versions and is heavily used by
high-performance applications (eg HBase or Impala). We generally refer to this
feature as "short-circuit read".
Then, in the current work on caching, we noticed that, if data is in-memory and
has already been touched once (minor page-faulted), the performance you can get
from mmap-based access is significantly better than what you get by calling
read, since you avoid a memory copy. This is the new "zero-copy read" API
introduced in HDFS-4953. In particular see [this
comment|https://issues.apache.org/jira/browse/HDFS-4953?focusedCommentId=13707586&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13707586]
in which I described a benchmark where zero-copy beats short-circuit by about
3x.
Of course, zero-copy read is not always safe if the data is not mlocked, for
the reasons you identified - in particular there is the chance of SIGBUS on an
IO error. So, this JIRA seeks to make the client automatically switch back and
forth between zero-copy read and short-circuit read (ie read syscall) based on
whether the datanode side has mlocked the memory. The options that Colin
proposed in the earlier comment are discussing the _control plane_ --
specifically how the DN communicates to the client whether an area of the file
is safe to zero-copy or not. The option that he settled on for the control
plane is a shared-memory segment, which happens to be implemented using an
mmapped tmpfs file (a very common mechanism for shared memory inter-process
communication on Linux)
None of this feature is meant to remove the original short-circuit
implementation, and in fact the fd-passing is used both for short-circuit reads
(read() on the passed fd) _and_ for the zero-copy reads (access to an mmap() of
the passed fd).
Hope that helps resolve the confusion above.
> BlockReaderLocal must allow zero-copy reads only when the DN believes it's
> valid
> ---------------------------------------------------------------------------------
>
> Key: HDFS-5182
> URL: https://issues.apache.org/jira/browse/HDFS-5182
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs-client
> Affects Versions: 3.0.0
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
>
> BlockReaderLocal must allow zero-copy reads only when the DN believes it's
> valid. This implies adding a new field to the response to
> REQUEST_SHORT_CIRCUIT_FDS. We also need some kind of heartbeat from the
> client to the DN, so that the DN can inform the client when the mapped region
> is no longer locked into memory.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)