[
https://issues.apache.org/jira/browse/HDFS-347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013719#comment-13013719
]
Nathan Roberts commented on HDFS-347:
-------------------------------------
We have been looking closely at the capability introduced in this Jira because
the initial results look very promising. However, after looking deeper, I’m not
convinced this is an approach that makes the most sense at this time. This Jira
is all about getting the maximum performance when the blocks of a file are on
the local node. Obviously performance of this use case is a critical piece of
“move computation to data”. However, if going through the datanode were to
offer the same level of performance as going direct at the files, then this
Jira wouldn’t even exist. So, I think it’s really important for us to
understand the performance benefits of going direct and the real root causes of
any performance differences between going direct and having the data flow
through the datanode. Once that is well understood, then I think we could look
at the value proposition of this change. We’ve tried to do some of this
analysis and the results follow. Key word here is “some”. I feel we’ve gathered
enough data to draw some valuable conclusions, but I don’t think it’s enough
data to say this type of approach wouldn’t be worth pursuing down the road.
For the impatient, the paragraphs below can be summarized with the following
points:
+ Going through the datanode maintains architectural layering. All other things
being equal, it would be best to avoid exposing the internal details of how the
datanode maintains its data. Violations of this layering could paint us into a
corner down the road and therefore should be avoided.
+ Benchmarked localhost sockets at 650MB/sec (write->read) and
1.6GB/sec(sendfile->read). nc uses 1K buffers and this probably explains the
low bandwidth observed as part of this jira.
+ Measured maximum client ingest rate at 280MB/sec for sockets. Checksum
calculation seems to play a big part of this limit.
+ Measured maximum datanode streaming output rate of 827MB/sec.
+ Measured maximum datanode random read output rate of 221MB/sec (with
hdfs-941).
+ The maximum client ingest rate of 280MB/sec is significantly slower than the
maximum datanode streaming output rate of 827MB/sec and only marginally faster
than the maximum datanode random output rate of 221MB/sec. This seems to say
that with the current bottlenecks, there isn’t a ton of performance to be
gained from going direct, at least not for the simple test cases used here.
For the detail oriented, keep reading.
If everything were optimized in the system then going direct is certainly going
to have a performance advantage (less layers means higher top-end performance).
However, the questions are:
+ How much of a performance gain?
+ Can this gain be realized with existing use cases?
+ Is the gain worth the layering violations? For example, what if we decided to
automatically merge small blks into single files? In order to access this data
directly, both the datanode and the client side code would have to be cognizant
of this format. Or what if we wanted to support encrypted content? Or if we
wanted to handle I/O errors differently than they’re handled today? I’m sure
there are others I’m not thinking of.
Ok, now for some data.
One of the initial comments talked about overhead of localhost network
connections. The comment used nc to measure bandwidth through a socket vs
bandwidth through a pipe. We looked into this a little because this was a bit
surprising. Sure enough on my rhel5 system, I saw pretty much the same numbers.
Digging deeper, nc uses a 1K buffer in rhel5, this can’t be good for
throughput. So, we ran lmbench on the same system to see what sort of results
we get. localhost sockets and pipes both came in right around 660MB/sec with
64K blocksizes. Pipes will probably scale up a bit better across more cores but
I would not expect to see a 5x difference as the original nc experiment showed.
We also modified lmbench to use sendfile() instead of write() in the local
socket test and measured this throughput to be 1.6GB/sec.
CONCLUSION: A localhost socket should be able to move around 650MB/sec for
write->read, and 1.6GB/sec for sendfile->read.
The remaining results involve hdfs. In these tests the blks being read are all
in the kernel page cache. This was done to completely remove disk seek
latencies from the equation and to completely highlight any datanode overheads.
io.file.buffer.size was 64K in all tests. (Todd measured a 30% improvement
using the direct method with checksums enabled. I can’t completely reconcile
this improvement with the results below but I’m wondering if it’s due to that
test using the default of 4K buffers??? I think the results of that test would
be consistent with the results below if that were the case. In any event it
would be good to reconcile the differences at some point.)
The next piece of data we wanted was the maximum rate at which the client can
ingest data. The first thing we did was to run a simple streaming read. In this
case we saw about 280 MB/sec. This is nowhere near 1.6GB/sec so the bottleneck
must be either the client and/or the server (i.e. it’s not the pipe). The
client process was at 100% CPU, so it’s probably there. To verify, we disabled
checksum verification on the client and this number went up to 776MB/sec and
client CPU utilization was still 100%. The bottleneck appears to still be at
the client. This is most likely due to the fact that the client has to
actually copy the data out of the kernel while the datanode uses sendfile.
CONCLUSION: Maximum client ingest rate for a stream is around 280MB/sec.
Datanode is capable of streaming out at least 776MB/sec. Given current client
code, there would not be a significant advantage to going direct to the file
because checksum calculation and other client overheads limit its ingestion
rate to 285MB/sec and the datanode is easily capable of sustaining this rate
for streaming reads.
The next thing we wanted to look at was random I/O. There is a lot more
overhead on the datanode for this particular use case so this could be a place
where direct access could really excel. The first thing we did here was run a
simple random read test to again measure the maximum read throughput. In this
case we measured 105MB/sec. Again we tried to eliminate the bottlenecks.
However, it’s more complicated in the random read case due to the fact that it
is a request/response type of protocol. So, first we focused on the datanode.
hdfs-941 is a proposed change which helps the pread use case significantly. The
implementation in 941 seems very reasonable and looks to be wrapping up very
soon. So, we applied the 941 patch and this improved the throughput to
143MB/sec.
This isn’t at the 285MB/sec yet so it’s still conceivable that going direct
could add a nice boost.
Since this is a request/response protocol, the checksum processing on the
client will impact the overall throughput of random I/O use cases. With
checksums disabled, the random I/O throughput increased from 143MB/sec to
221MB/sec.
CONCLUSION: A localhost socket maxes out at around 1.6GB/sec, we measured
827MB/sec for no-checksum streaming reads. The datanode is currently not
capable of maxing out a localhost socket.
CONCLUSION: Clients can currently ingest about 280MB/sec. This rate is easily
reached with streaming reads. For random reads, with HDFS-941, this rate is a
bit faster (280MB/sec vs 221MB/sec) but not dramatically so. Therefore, for
today the right approach seems to be to enhance the datanode to make sure the
bottleneck is squarely at the client. Since the bottleneck is mainly due to
checksum calculation and data copies out of the kernel, going direct to a blk
file shouldn’t have a significant impact because both of these overhead
activities need to be performed whether going direct or not.
The results above are all in terms of single reader throughput of cached blk
files. More scalability testing needs to be performed. We did verify that on a
dual-quad core system that the datanode could scale its random read throughput
from 137MB/sec to 480MB/sec with 4 readers. This was enough load to saturate 5
of the 8 cores with clients consuming 3 and datanodes consuming 2. It’s just
one data point, there’s lots more work to be done in the area of datanode
scalability.
Latency is also a critical attribute of the datanode and some more data needs
to be gathered in this area. However, I propose we focus on fixing any
contention/latency issues within the datanode prior to diving into a direct I/O
sort of approach (and there are already a few jiras out there that are in the
area of improving concurrency within the datanode). If we can’t get anywhere
near the latency requirements, then at that point we should consider more
efficient ways of getting at the data.
Thanks to Kihwal Lee and Dave Thompson for doing a significant amount of data
gathering! Gathering this type of data always seems to take longer than one
would think, so thank you for the efforts.
> DFS read performance suboptimal when client co-located on nodes with data
> -------------------------------------------------------------------------
>
> Key: HDFS-347
> URL: https://issues.apache.org/jira/browse/HDFS-347
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: George Porter
> Assignee: Todd Lipcon
> Attachments: BlockReaderLocal1.txt, HADOOP-4801.1.patch,
> HADOOP-4801.2.patch, HADOOP-4801.3.patch, HDFS-347-branch-20-append.txt,
> all.tsv, hdfs-347.png, hdfs-347.txt, local-reads-doc
>
>
> One of the major strategies Hadoop uses to get scalable data processing is to
> move the code to the data. However, putting the DFS client on the same
> physical node as the data blocks it acts on doesn't improve read performance
> as much as expected.
> After looking at Hadoop and O/S traces (via HADOOP-4049), I think the problem
> is due to the HDFS streaming protocol causing many more read I/O operations
> (iops) than necessary. Consider the case of a DFSClient fetching a 64 MB
> disk block from the DataNode process (running in a separate JVM) running on
> the same machine. The DataNode will satisfy the single disk block request by
> sending data back to the HDFS client in 64-KB chunks. In BlockSender.java,
> this is done in the sendChunk() method, relying on Java's transferTo()
> method. Depending on the host O/S and JVM implementation, transferTo() is
> implemented as either a sendfilev() syscall or a pair of mmap() and write().
> In either case, each chunk is read from the disk by issuing a separate I/O
> operation for each chunk. The result is that the single request for a 64-MB
> block ends up hitting the disk as over a thousand smaller requests for 64-KB
> each.
> Since the DFSClient runs in a different JVM and process than the DataNode,
> shuttling data from the disk to the DFSClient also results in context
> switches each time network packets get sent (in this case, the 64-kb chunk
> turns into a large number of 1500 byte packet send operations). Thus we see
> a large number of context switches for each block send operation.
> I'd like to get some feedback on the best way to address this, but I think
> providing a mechanism for a DFSClient to directly open data blocks that
> happen to be on the same machine. It could do this by examining the set of
> LocatedBlocks returned by the NameNode, marking those that should be resident
> on the local host. Since the DataNode and DFSClient (probably) share the
> same hadoop configuration, the DFSClient should be able to find the files
> holding the block data, and it could directly open them and send data back to
> the client. This would avoid the context switches imposed by the network
> layer, and would allow for much larger read buffers than 64KB, which should
> reduce the number of iops imposed by each read block operation.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira