[
https://issues.apache.org/jira/browse/HDFS-347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon updated HDFS-347:
-----------------------------
Attachment: local-reads-doc
Attaching v1 of a design document for this feature. This does not include a
test plan - that will follow once implementation has gone a bit further.
Pasting the design doc below as well:
--
h1. Design Document: Local Read Optimization
h2. Problem Definition
Currently, when the DFS Client is located on the same physical node as the
DataNode serving the data, it does not use this knowledge to its advantage. All
blocks are read through the same protocol based on a TCP connection. Early
experimentation has shown that this has a 20-30% overhead when compared with
reading the block files directly off the local disk.
This JIRA seeks to improve the performance of node-local reads by providing a
fast path that is enabled in this case. This case is very common, especially in
the context of MapReduce jobs where tasks are scheduled local to their data.
Although writes are likely to see an improvement here too, this JIRA will focus
only on the read path. The write path is significantly more complicated due to
write pipeline recovery, append support, etc. Additionally, the write path will
still have to go over TCP to the non-local replicas, so the throughput
improvements will probably not be as marked.
h2. Use Cases
# As mentioned above, the majority of data read during a MapReduce job tends to
be from local datanodes. This optimization should improve MapReduce performance
of read-constrained jobs significantly.
# Random reads should see a significant performance benefit with this patch as
well. Applications such as the HBase Region Server should see a very large
improvement.
Users will not have to make any specific changes to use the performance
improvement - the optimization should be transparent and retain all existing
semantics.
h2. Interaction with Current System
This behavior needs modifications in two areas:
h3. DataNode
The datanode needs to be extended to provide access to the local block storage
to the reading client.
h3. DFSInputStream
DFSInputStream needs to be extended in order to enable the fast read path when
reading from local datanodes.
h2. Requirements
h3. Unix Domain Sockets via JNI
In order to maintain security, we cannot simply have the reader access blocks
through the local filesystem. The reader may be running as an arbitrary user
ID, and we should not require world-readable permissions on the block storage.
Unix domain sockets offer the ability to transport already-open file
descriptors from one peer to another using the "ancillary data" construct and
the sendmsg(2) system call. This ability is documented in unix(7) under the
SCM_RIGHTS section.
Unix domain sockets are unfortunately not available in Java. We will need to
employ JNI to access the appropriate system calls.
h3. Modify DFSClient/DataNode interaction
The DFS Client will need to be able to initiate the fast path read when it
detects it is connecting to a local DataNode. The DataNode needs to respond to
this request by providing the appropriate file descriptors or by reverting to
the normal slow path if the functionality has been administratively disabled,
etc.
h2. Design
h3. Unix Domain Sockets in Java
The Android open source project currently includes support for Unix Domain
Sockets in the android.net package. It also includes the native JNI code to
implement these classes. Android is Apache 2.0 licensed and thus we can freely
use the code in Hadoop.
The Android project relies on a lot of custom build infrastructure and utility
functions. In order to reduce our dependencies, we will copy the appropriate
classes into a new org.apache.hadoop.net.unix package. We will include the
appropriate JNI code in the existing libhadoop library. If HADOOP-4998 (native
runtime library for Hadoop) progresses in the near term, we could include this
functionality there.
The JNI code needs small modifications to work properly in the Hadoop build
system without pulling in a large number of Android dependencies.
h3. Fast path initiation
When DFSInputStream is connecting to a node, it can determine whether that node
is local by simply inspecting the IP address. In the event that it is a local
host and the fast path has not been prohibited by the Configuration, the fast
path will be initiated. The fast path is simply a different BlockReader
implementation.
h3. Fast path interface
BlockReader will become an interface, with the current implementation being
renamed to RemoteBlockReader. The fast-path for local reads will be a
LocalBlockReader, which is instantiated after it has been determined that the
target datanode is local.
h3. Fast path mechanism
Currently, when the DFSInputStream connects to the DataNode, it sends
OP_READ_BLOCK, including the access token, block id, etc. Instead, when the
fast path is desired, the client will take the following steps:
# Opens a unix socket listening in the in-memory socket namespace. The socket's
name will be identical to the clientName already available in the input stream,
plus a unique ID for this specific input stream (so that parallel local readers
function without collision).
# Sends a new opcode OP_CONNECT_UNIX. This operation takes the same parameters
as OP_READ_BLOCK, but indicates to the datanode that the client is looking for
a local connection.
# The datanode performs the same access token and block validity checks as it
currently does for the OP_READ_BLOCk case. Thus the security model of the
current implementation is retained.
# If the datanode refuses for any reason, it responds over the block
transceiver protocol with the same error mechanism as the current approach. If
the checks pass:
## DN connects to the client via the unix socket.
## DN opens the block data file and block metadata file
## DN extracts the FileDescriptor objects from these InputStreams, and sends
them as ancillary data on the unix domain socket. It then closes its side of
the unix domain socket.
## DN sends an "OK" response via the TCP socket.
## If any error happens during this process, it sends back an error response.
# On the client side, if an error response is received from the OP_CONNECT_UNIX
request, the client will mark a flag indicating that it should no longer try
the fast path, and then fall back to the existing BlockReader.
# If an OK response is received, the client constructs a LocalBlockReader (LBR).
## The LBR reads from the unix domain socket to receive the block data and
metadata file descriptors.
## At this point, both the TCP socket and the unix socket can be closed; the
file descriptors remain valid once they have been received despite any closed
sockets.
## The LBR then provides the BlockReader interface by simply calling seek(),
read(), etc, on an input stream constructed from these file descriptors.
## Some refactoring may occur here to try to share checksum verification code
between the LocalBlockReader and RemoteBlockReader.
The reason for the connect-back protocol rather than having the datanode simply
listen on a unix socket is to simplify the integration path. In order to listen
on a socket, the datanode would need an additional thread to spawn off
transceivers. Additionally, it allows for a way to verify that the client is in
fact reading from the datanode on the target host/port without relying on some
conventional socket path.
h3. DFS Read semantics clarification
Before embarking on the above, the DFS Read semantics should be clarified. The
error handling and retry semantics in the current code are quite unclear. For
example, there is significant discussion in HDFS-127 that indicates a lot of
confusion about proper behavior.
Although the work is orthogonal to this patch, it will be quite beneficial to
nail down the semantics of the existing implementation before attempting to add
onto it. I propose this work be done in a separate JIRA concurrently with
discussion on this one, with the two pieces of work to be committed together if
possible. This will keep the discussion here on-point and avoid digression into
discussion of existing problems like HDFS-127.
h2. Failure Analysis
As described above, if any failure or exception occurs during the establishment
of the fast path, the system will simply fall back to the existing slow path.
One issue that is currently unclear is how to handle IOExceptions on the
underlying blocks when the read is being performed by the client. See *Work
Remaining* below.
h2. Security Analysis
Since the block open() call is still being performed by the datanode, there is
no loss of security with this patch. AccessToken checking is performed by the
datanode in the same manner as currently exists. Since the blocks can be opened
read-only, the recipient of the file descriptors cannot perform unwanted
modification.
h2. Platform Support
Passing file descriptors over Unix Domain Sockets is supported on Linux, BSD,
and Solaris. There may be some differences in the different implementations.
The first version of this JIRA should target Linux only, and automatically
disable itself on platforms where it will not function correctly. Since this is
an optimization and not a new feature (the slow path will continue to be
supported) I believe this is OK.
h2. Work already completed
h3. Initial experimentation
The early work in HDFS-347 indicated that the performance improvements of this
patch will be substantial. The experimentation modified the BlockReader to
"cheat" and simply open the stored blocks with standard file APIs, which had
been chmodded world readable. This improved read of a 1GB from 8.7 seconds to
5.3 seconds, and improved random IO performance by a factor of more than 30.
h3. Local Sockets and JNI Library
I have already ported the local sockets JNI code from the Android project into
a local branch of the Hadoop code base, and written simple unit tests to verify
its operation. The JNI code compiles as part of libhadoop, and the Java side
uses the existing NativeCodeLoader class. These patches will become part of the
Common project.
h3. DFSInputStream refactoring
To aid in testing and understanding of the code, I have refactored
DFSInputStream to be a standalone class instead of an inner class of DFSClient.
Additionally, I have converted BlockReader to an interface and renamed
BlockReader to RemoteBlockReader. In the process I also refactored the contents
of DFSInputStream to clarify the failure and retry semantics. This work should
be migrated to another JIRA as mentioned above.
h3. Fast path initiation and basic operation
I have implemented the algorithm as described above and added new unit tests to
verify operation. Basic unit tests are currently passing using the fast path
reads.
h2. Work Remaining / Open Questions
h3. Checksums
The current implementation of LocalBlockReader does not verify checksums. Thus,
some unit tests are not passing. Some refactoring will probably need to be done
to share the checksum verification code between LocalBlockReader and
RemoteBlockReader.
h3. IOException handling
Given that the reads are now occuring directly from the client, we should
investigate whether we need to add any mechanism for the client to report
errors back to the DFS. The client can still report checksum errors in the
existing mechanism, but we may need to add some method by which it can report
IO Errors (e.g. due to a failing volume). I do not know the current state of
volume error tracking in the datanode; some guidance here would be appreciated.
h3. Interaction with other features (e.g. Append)
We should investigate whether (and how) this feature will interact with other
ongoing work, in particular appends. If there is any complication, it should be
straightforward to simply disable the fast path for any blocks currently under
construction. Given that the primary benefit for the fast path is in mapreduce
jobs, and mapreduce jobs rarely run on under-construction blocks, this seems
reasonable and avoids a lot of complexity.
h3. Timeouts
Currently, the JNI library has some TODO markings for implementation of
timeouts on various socket operations. These will need to be implemented for
proper operation.
h3. Benchmarking
Given that this is a performance patch, benchmarks of the final implementation
should be done, covering both random and sequential IO.
h3. Statistics/metrics tracking
Currently, the datanode maintains metrics about the number of bytes read and
written. We no longer will have accurate information unless we make reports
back from the client. Alternatively, the datanode can use the "length"
parameter of OP_READ_UNIX and assume that the client will always read the
entirety of data it has requested. This is not a fair assumption, but the
approximation may be fine.
h3. Audit Logs/ClientTrace
Once the DN has sent a file descriptor for a block to the client, it is
impossible to audit the byte offsets that are read. It is possible for a client
to request read access to a small byte range of a block, receive a socket, and
then proceed to read the entire block. We should investigate whether there is a
requirement for byte-range granularity on audit logs and come up with possible
solutions (eg disabling fast path for non-whole-block reads).
h3. File Descriptors held by Zombie Processes
In practice on some clusters, DFSClient processes can stick around as zombie
processes. In the TCP-based DFSClient, these zombie connections are eventually
timed out by the DN server. In this proposed JIRA, the file descriptors would
be already transferred, and thus would be stuck open on the zombie. This will
not block file deletion, but does block the reclaiming of the blocks on the
underlying file system. This may cause problems on HDFS instances with a lot of
block churn and a bad zombie problem. Dhruba can possibly elaborate here.
h3. Determining local IPs
In order to determine when to attempt the fast path, the DFSClient needs to
know when it is connecting to a local datanode. This will rarely be a loopback
IP address, so we need some way of determining which IPs are actually local.
This will probably necessitate an additional method or two in NetUtils in order
to inspect the local interface list, with some caching behavior.
> DFS read performance suboptimal when client co-located on nodes with data
> -------------------------------------------------------------------------
>
> Key: HDFS-347
> URL: https://issues.apache.org/jira/browse/HDFS-347
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: George Porter
> Attachments: HADOOP-4801.1.patch, HADOOP-4801.2.patch,
> HADOOP-4801.3.patch, local-reads-doc
>
>
> One of the major strategies Hadoop uses to get scalable data processing is to
> move the code to the data. However, putting the DFS client on the same
> physical node as the data blocks it acts on doesn't improve read performance
> as much as expected.
> After looking at Hadoop and O/S traces (via HADOOP-4049), I think the problem
> is due to the HDFS streaming protocol causing many more read I/O operations
> (iops) than necessary. Consider the case of a DFSClient fetching a 64 MB
> disk block from the DataNode process (running in a separate JVM) running on
> the same machine. The DataNode will satisfy the single disk block request by
> sending data back to the HDFS client in 64-KB chunks. In BlockSender.java,
> this is done in the sendChunk() method, relying on Java's transferTo()
> method. Depending on the host O/S and JVM implementation, transferTo() is
> implemented as either a sendfilev() syscall or a pair of mmap() and write().
> In either case, each chunk is read from the disk by issuing a separate I/O
> operation for each chunk. The result is that the single request for a 64-MB
> block ends up hitting the disk as over a thousand smaller requests for 64-KB
> each.
> Since the DFSClient runs in a different JVM and process than the DataNode,
> shuttling data from the disk to the DFSClient also results in context
> switches each time network packets get sent (in this case, the 64-kb chunk
> turns into a large number of 1500 byte packet send operations). Thus we see
> a large number of context switches for each block send operation.
> I'd like to get some feedback on the best way to address this, but I think
> providing a mechanism for a DFSClient to directly open data blocks that
> happen to be on the same machine. It could do this by examining the set of
> LocatedBlocks returned by the NameNode, marking those that should be resident
> on the local host. Since the DataNode and DFSClient (probably) share the
> same hadoop configuration, the DFSClient should be able to find the files
> holding the block data, and it could directly open them and send data back to
> the client. This would avoid the context switches imposed by the network
> layer, and would allow for much larger read buffers than 64KB, which should
> reduce the number of iops imposed by each read block operation.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.