[jira] Updated: (HDFS-347) DFS read performance suboptimal when client co-located on nodes with data

Todd Lipcon (JIRA) Tue, 06 Oct 2009 23:59:09 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Todd Lipcon updated HDFS-347:
-----------------------------

    Attachment: local-reads-doc

Attaching v1 of a design document for this feature. This does not include a 
test plan - that will follow once implementation has gone a bit further. 
Pasting the design doc below as well:

--

h1. Design Document: Local Read Optimization

h2. Problem Definition

Currently, when the DFS Client is located on the same physical node as the 
DataNode serving the data, it does not use this knowledge to its advantage. All 
blocks are read through the same protocol based on a TCP connection. Early 
experimentation has shown that this has a 20-30% overhead when compared with 
reading the block files directly off the local disk.

This JIRA seeks to improve the performance of node-local reads by providing a 
fast path that is enabled in this case. This case is very common, especially in 
the context of MapReduce jobs where tasks are scheduled local to their data.

Although writes are likely to see an improvement here too, this JIRA will focus 
only on the read path. The write path is significantly more complicated due to 
write pipeline recovery, append support, etc. Additionally, the write path will 
still have to go over TCP to the non-local replicas, so the throughput 
improvements will probably not be as marked.

h2. Use Cases

# As mentioned above, the majority of data read during a MapReduce job tends to 
be from local datanodes. This optimization should improve MapReduce performance 
of read-constrained jobs significantly.
# Random reads should see a significant performance benefit with this patch as 
well. Applications such as the HBase Region Server should see a very large 
improvement.

Users will not have to make any specific changes to use the performance 
improvement - the optimization should be transparent and retain all existing 
semantics.

h2. Interaction with Current System

This behavior needs modifications in two areas:

h3. DataNode

The datanode needs to be extended to provide access to the local block storage 
to the reading client.

h3. DFSInputStream

DFSInputStream needs to be extended in order to enable the fast read path when 
reading from local datanodes.

h2. Requirements

h3. Unix Domain Sockets via JNI

In order to maintain security, we cannot simply have the reader access blocks 
through the local filesystem. The reader may be running as an arbitrary user 
ID, and we should not require world-readable permissions on the block storage.

Unix domain sockets offer the ability to transport already-open file 
descriptors from one peer to another using the "ancillary data" construct and 
the sendmsg(2) system call. This ability is documented in unix(7) under the 
SCM_RIGHTS section.

Unix domain sockets are unfortunately not available in Java. We will need to 
employ JNI to access the appropriate system calls.

h3. Modify DFSClient/DataNode interaction

The DFS Client will need to be able to initiate the fast path read when it 
detects it is connecting to a local DataNode. The DataNode needs to respond to 
this request by providing the appropriate file descriptors or by reverting to 
the normal slow path if the functionality has been administratively disabled, 
etc.

h2. Design

h3. Unix Domain Sockets in Java

The Android open source project currently includes support for Unix Domain 
Sockets in the android.net package. It also includes the native JNI code to 
implement these classes. Android is Apache 2.0 licensed and thus we can freely 
use the code in Hadoop.

The Android project relies on a lot of custom build infrastructure and utility 
functions. In order to reduce our dependencies, we will copy the appropriate 
classes into a new org.apache.hadoop.net.unix package. We will include the 
appropriate JNI code in the existing libhadoop library. If HADOOP-4998 (native 
runtime library for Hadoop) progresses in the near term, we could include this 
functionality there.

The JNI code needs small modifications to work properly in the Hadoop build 
system without pulling in a large number of Android dependencies.

h3. Fast path initiation

When DFSInputStream is connecting to a node, it can determine whether that node 
is local by simply inspecting the IP address. In the event that it is a local 
host and the fast path has not been prohibited by the Configuration, the fast 
path will be initiated. The fast path is simply a different BlockReader 
implementation.

h3. Fast path interface

BlockReader will become an interface, with the current implementation being 
renamed to RemoteBlockReader. The fast-path for local reads will be a 
LocalBlockReader, which is instantiated after it has been determined that the 
target datanode is local.

h3. Fast path mechanism

Currently, when the DFSInputStream connects to the DataNode, it sends 
OP_READ_BLOCK, including the access token, block id, etc. Instead, when the 
fast path is desired, the client will take the following steps:

# Opens a unix socket listening in the in-memory socket namespace. The socket's 
name will be identical to the clientName already available in the input stream, 
plus a unique ID for this specific input stream (so that parallel local readers 
function without collision).
# Sends a new opcode OP_CONNECT_UNIX. This operation takes the same parameters 
as OP_READ_BLOCK, but indicates to the datanode that the client is looking for 
a local connection.
# The datanode performs the same access token and block validity checks as it 
currently does for the OP_READ_BLOCk case. Thus the security model of the 
current implementation is retained.
# If the datanode refuses for any reason, it responds over the block 
transceiver protocol with the same error mechanism as the current approach. If 
the checks pass:
## DN connects to the client via the unix socket.
## DN opens the block data file and block metadata file
## DN extracts the FileDescriptor objects from these InputStreams, and sends 
them as ancillary data on the unix domain socket. It then closes its side of 
the unix domain socket.
## DN sends an "OK" response via the TCP socket.
## If any error happens during this process, it sends back an error response.
# On the client side, if an error response is received from the OP_CONNECT_UNIX 
request, the client will mark a flag indicating that it should no longer try 
the fast path, and then fall back to the existing BlockReader.
# If an OK response is received, the client constructs a LocalBlockReader (LBR).
## The LBR reads from the unix domain socket to receive the block data and 
metadata file descriptors.
## At this point, both the TCP socket and the unix socket can be closed; the 
file descriptors remain valid once they have been received despite any closed 
sockets.
## The LBR then provides the BlockReader interface by simply calling seek(), 
read(), etc, on an input stream constructed from these file descriptors.
## Some refactoring may occur here to try to share checksum verification code 
between the LocalBlockReader and RemoteBlockReader.

The reason for the connect-back protocol rather than having the datanode simply 
listen on a unix socket is to simplify the integration path. In order to listen 
on a socket, the datanode would need an additional thread to spawn off 
transceivers. Additionally, it allows for a way to verify that the client is in 
fact reading from the datanode on the target host/port without relying on some 
conventional socket path.

h3. DFS Read semantics clarification

Before embarking on the above, the DFS Read semantics should be clarified. The 
error handling and retry semantics in the current code are quite unclear. For 
example, there is significant discussion in HDFS-127 that indicates a lot of 
confusion about proper behavior.

Although the work is orthogonal to this patch, it will be quite beneficial to 
nail down the semantics of the existing implementation before attempting to add 
onto it. I propose this work be done in a separate JIRA concurrently with 
discussion on this one, with the two pieces of work to be committed together if 
possible. This will keep the discussion here on-point and avoid digression into 
discussion of existing problems like HDFS-127.

h2. Failure Analysis

As described above, if any failure or exception occurs during the establishment 
of the fast path, the system will simply fall back to the existing slow path.

One issue that is currently unclear is how to handle IOExceptions on the 
underlying blocks when the read is being performed by the client. See *Work 
Remaining* below.

h2. Security Analysis

Since the block open() call is still being performed by the datanode, there is 
no loss of security with this patch. AccessToken checking is performed by the 
datanode in the same manner as currently exists. Since the blocks can be opened 
read-only, the recipient of the file descriptors cannot perform unwanted 
modification.

h2. Platform Support

Passing file descriptors over Unix Domain Sockets is supported on Linux, BSD, 
and Solaris. There may be some differences in the different implementations. 
The first version of this JIRA should target Linux only, and automatically 
disable itself on platforms where it will not function correctly. Since this is 
an optimization and not a new feature (the slow path will continue to be 
supported) I believe this is OK.

h2. Work already completed

h3. Initial experimentation

The early work in HDFS-347 indicated that the performance improvements of this 
patch will be substantial. The experimentation modified the BlockReader to 
"cheat" and simply open the stored blocks with standard file APIs, which had 
been chmodded world readable. This improved read of a 1GB from 8.7 seconds to 
5.3 seconds, and improved random IO performance by a factor of more than 30.

h3. Local Sockets and JNI Library

I have already ported the local sockets JNI code from the Android project into 
a local branch of the Hadoop code base, and written simple unit tests to verify 
its operation. The JNI code compiles as part of libhadoop, and the Java side 
uses the existing NativeCodeLoader class. These patches will become part of the 
Common project.

h3. DFSInputStream refactoring

To aid in testing and understanding of the code, I have refactored 
DFSInputStream to be a standalone class instead of an inner class of DFSClient. 
Additionally, I have converted BlockReader to an interface and renamed 
BlockReader to RemoteBlockReader. In the process I also refactored the contents 
of DFSInputStream to clarify the failure and retry semantics. This work should 
be migrated to another JIRA as mentioned above.

h3. Fast path initiation and basic operation

I have implemented the algorithm as described above and added new unit tests to 
verify operation. Basic unit tests are currently passing using the fast path 
reads.

h2. Work Remaining / Open Questions

h3. Checksums

The current implementation of LocalBlockReader does not verify checksums. Thus, 
some unit tests are not passing. Some refactoring will probably need to be done 
to share the checksum verification code between LocalBlockReader and 
RemoteBlockReader.

h3. IOException handling

Given that the reads are now occuring directly from the client, we should 
investigate whether we need to add any mechanism for the client to report 
errors back to the DFS. The client can still report checksum errors in the 
existing mechanism, but we may need to add some method by which it can report 
IO Errors (e.g. due to a failing volume). I do not know the current state of 
volume error tracking in the datanode; some guidance here would be appreciated.

h3. Interaction with other features (e.g. Append)

We should investigate whether (and how) this feature will interact with other 
ongoing work, in particular appends. If there is any complication, it should be 
straightforward to simply disable the fast path for any blocks currently under 
construction. Given that the primary benefit for the fast path is in mapreduce 
jobs, and mapreduce jobs rarely run on under-construction blocks, this seems 
reasonable and avoids a lot of complexity.

h3. Timeouts

Currently, the JNI library has some TODO markings for implementation of 
timeouts on various socket operations. These will need to be implemented for 
proper operation.

h3. Benchmarking

Given that this is a performance patch, benchmarks of the final implementation 
should be done, covering both random and sequential IO.

h3. Statistics/metrics tracking

Currently, the datanode maintains metrics about the number of bytes read and 
written. We no longer will have accurate information unless we make reports 
back from the client. Alternatively, the datanode can use the "length" 
parameter of OP_READ_UNIX and assume that the client will always read the 
entirety of data it has requested. This is not a fair assumption, but the 
approximation may be fine.

h3. Audit Logs/ClientTrace

Once the DN has sent a file descriptor for a block to the client, it is 
impossible to audit the byte offsets that are read. It is possible for a client 
to request read access to a small byte range of a block, receive a socket, and 
then proceed to read the entire block. We should investigate whether there is a 
requirement for byte-range granularity on audit logs and come up with possible 
solutions (eg disabling fast path for non-whole-block reads).

h3. File Descriptors held by Zombie Processes

In practice on some clusters, DFSClient processes can stick around as zombie 
processes. In the TCP-based DFSClient, these zombie connections are eventually 
timed out by the DN server. In this proposed JIRA, the file descriptors would 
be already transferred, and thus would be stuck open on the zombie. This will 
not block file deletion, but does block the reclaiming of the blocks on the 
underlying file system. This may cause problems on HDFS instances with a lot of 
block churn and a bad zombie problem. Dhruba can possibly elaborate here.

h3. Determining local IPs

In order to determine when to attempt the fast path, the DFSClient needs to 
know when it is connecting to a local datanode. This will rarely be a loopback 
IP address, so we need some way of determining which IPs are actually local. 
This will probably necessitate an additional method or two in NetUtils in order 
to inspect the local interface list, with some caching behavior.



> DFS read performance suboptimal when client co-located on nodes with data
> -------------------------------------------------------------------------
>
>                 Key: HDFS-347
>                 URL: https://issues.apache.org/jira/browse/HDFS-347
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: George Porter
>         Attachments: HADOOP-4801.1.patch, HADOOP-4801.2.patch, 
> HADOOP-4801.3.patch, local-reads-doc
>
>
> One of the major strategies Hadoop uses to get scalable data processing is to 
> move the code to the data.  However, putting the DFS client on the same 
> physical node as the data blocks it acts on doesn't improve read performance 
> as much as expected.
> After looking at Hadoop and O/S traces (via HADOOP-4049), I think the problem 
> is due to the HDFS streaming protocol causing many more read I/O operations 
> (iops) than necessary.  Consider the case of a DFSClient fetching a 64 MB 
> disk block from the DataNode process (running in a separate JVM) running on 
> the same machine.  The DataNode will satisfy the single disk block request by 
> sending data back to the HDFS client in 64-KB chunks.  In BlockSender.java, 
> this is done in the sendChunk() method, relying on Java's transferTo() 
> method.  Depending on the host O/S and JVM implementation, transferTo() is 
> implemented as either a sendfilev() syscall or a pair of mmap() and write().  
> In either case, each chunk is read from the disk by issuing a separate I/O 
> operation for each chunk.  The result is that the single request for a 64-MB 
> block ends up hitting the disk as over a thousand smaller requests for 64-KB 
> each.
> Since the DFSClient runs in a different JVM and process than the DataNode, 
> shuttling data from the disk to the DFSClient also results in context 
> switches each time network packets get sent (in this case, the 64-kb chunk 
> turns into a large number of 1500 byte packet send operations).  Thus we see 
> a large number of context switches for each block send operation.
> I'd like to get some feedback on the best way to address this, but I think 
> providing a mechanism for a DFSClient to directly open data blocks that 
> happen to be on the same machine.  It could do this by examining the set of 
> LocatedBlocks returned by the NameNode, marking those that should be resident 
> on the local host.  Since the DataNode and DFSClient (probably) share the 
> same hadoop configuration, the DFSClient should be able to find the files 
> holding the block data, and it could directly open them and send data back to 
> the client.  This would avoid the context switches imposed by the network 
> layer, and would allow for much larger read buffers than 64KB, which should 
> reduce the number of iops imposed by each read block operation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-347) DFS read performance suboptimal when client co-located on nodes with data

Reply via email to