[jira] [Commented] (HDFS-347) DFS read performance suboptimal when client co-located on nodes with data

Nathan Roberts (JIRA) Wed, 30 Mar 2011 15:57:50 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013719#comment-13013719
 ]


Nathan Roberts commented on HDFS-347:
-------------------------------------


We have been looking closely at the capability introduced in this Jira because 
the initial results look very promising. However, after looking deeper, I’m not 
convinced this is an approach that makes the most sense at this time. This Jira 
is all about getting the maximum performance when the blocks of a file are on 
the local node. Obviously performance of this use case is a critical piece of  
“move computation to data”.  However, if going through the datanode were to 
offer the same level of performance as going direct at the files, then this 
Jira wouldn’t even exist. So, I think it’s really important for us to 
understand the performance benefits of going direct and the real root causes of 
any performance differences between going direct and having the data flow 
through the datanode. Once that is well understood, then I think we could look 
at the value proposition of this change. We’ve tried to do some of this 
analysis and the results follow. Key word here is “some”. I feel we’ve gathered 
enough data to draw some valuable conclusions, but I don’t think it’s enough 
data to say this type of approach wouldn’t be worth pursuing down the road. 

For the impatient, the paragraphs below can be summarized with the following 
points:
+ Going through the datanode maintains architectural layering. All other things 
being equal, it would be best to avoid exposing the internal details of how the 
datanode maintains its data. Violations of this layering could paint us into a 
corner down the road and therefore should be avoided.
+ Benchmarked localhost sockets at 650MB/sec (write->read) and 
1.6GB/sec(sendfile->read). nc uses 1K buffers and this probably explains the 
low bandwidth observed as part of this jira. 
+ Measured maximum client ingest rate at 280MB/sec for sockets. Checksum 
calculation seems to play a big part of this limit. 
+ Measured maximum datanode streaming output rate of 827MB/sec.  
+ Measured maximum datanode random read output rate of 221MB/sec (with 
hdfs-941). 
+ The maximum client ingest rate of 280MB/sec is significantly slower than the 
maximum datanode streaming output rate of 827MB/sec and only marginally faster 
than the maximum datanode random output rate of 221MB/sec. This seems to say 
that with the current bottlenecks, there isn’t a ton of performance to be 
gained from going direct, at least not for the simple test cases used here.

For the detail oriented, keep reading.

If everything were optimized in the system then going direct is certainly going 
to have a performance advantage (less layers means higher top-end performance). 
However, the questions are:
+ How much of a performance gain? 
+ Can this gain be realized with existing use cases? 
+ Is the gain worth the layering violations? For example, what if we decided to 
automatically merge small blks into single files? In order to access this data 
directly, both the datanode and the client side code would have to be cognizant 
of this format. Or what if we wanted to support encrypted content? Or if we 
wanted to handle I/O errors differently than they’re handled today? I’m sure 
there are others I’m not thinking of.

Ok, now for some data. 

One of the initial comments talked about overhead of localhost network 
connections. The comment used nc to measure bandwidth through a socket vs 
bandwidth through a pipe. We looked into this a little because this was a bit 
surprising. Sure enough on my rhel5 system, I saw pretty much the same numbers. 
Digging deeper, nc uses a 1K buffer in rhel5, this can’t be good for 
throughput. So, we ran lmbench on the same system to see what sort of results 
we get. localhost sockets and pipes both came in right around 660MB/sec with 
64K blocksizes. Pipes will probably scale up a bit better across more cores but 
I would not expect to see a 5x difference as the original nc experiment showed. 
We also modified lmbench to use sendfile() instead of write() in the local 
socket test and measured this throughput to be 1.6GB/sec. 

CONCLUSION: A localhost socket should be able to move around 650MB/sec for 
write->read, and 1.6GB/sec for sendfile->read.

The remaining results involve hdfs. In these tests the blks being read are all 
in the kernel page cache. This was done to completely remove disk seek 
latencies from the equation and to completely highlight any datanode overheads. 
 io.file.buffer.size was 64K in all tests. (Todd measured a 30% improvement 
using the direct method with checksums enabled. I can’t completely reconcile 
this improvement with the results below but I’m wondering if it’s due to that 
test using the default of 4K buffers??? I think the results of that test would 
be consistent with the results below if that were the case. In any event it 
would be good to reconcile the differences at some point.)

The next piece of data we wanted was the maximum rate at which the client can 
ingest data. The first thing we did was to run a simple streaming read. In this 
case we saw about 280 MB/sec. This is nowhere near 1.6GB/sec so the bottleneck 
must be either the client and/or the server (i.e. it’s not the pipe). The 
client process was at 100% CPU, so it’s probably there. To verify, we disabled 
checksum verification on the client and this number went up to 776MB/sec and 
client CPU utilization was still 100%.  The bottleneck appears to still be at 
the client. This is most likely due to the fact that the client  has to 
actually copy the data out of the  kernel while the datanode uses sendfile. 

CONCLUSION: Maximum client ingest rate for a stream is around 280MB/sec. 
Datanode is capable of streaming out at least 776MB/sec. Given current client 
code, there would not be a significant advantage to going direct to the file 
because checksum calculation and other client overheads limit its ingestion 
rate to 285MB/sec and the datanode is easily capable of sustaining this rate 
for streaming reads. 

The next thing we wanted to look at was random I/O. There is a lot more 
overhead on the datanode for this particular use case so this could be a place 
where direct access could really excel. The first thing we did here was run a 
simple random read test to again measure the maximum read throughput. In this 
case we measured 105MB/sec. Again we tried to eliminate the bottlenecks. 
However, it’s more complicated in the random read case due to the fact that it 
is a request/response type of protocol. So, first we focused on the datanode. 
hdfs-941 is a proposed change which helps the pread use case significantly. The 
implementation in 941 seems very reasonable and looks to be wrapping up very 
soon.  So, we applied the 941 patch and this improved the throughput to 
143MB/sec.  

This isn’t at the 285MB/sec yet so it’s still conceivable that going direct 
could add a nice boost. 

Since this is a request/response protocol, the checksum processing on the 
client will impact the overall throughput of random I/O use cases. With 
checksums disabled, the random I/O throughput increased from 143MB/sec to 
221MB/sec. 

CONCLUSION: A localhost socket maxes out at around 1.6GB/sec, we measured 
827MB/sec for no-checksum streaming reads. The datanode is currently not 
capable of maxing out a localhost socket.  
CONCLUSION: Clients can currently ingest about 280MB/sec. This rate is easily 
reached with streaming reads. For random reads, with HDFS-941, this rate is a 
bit faster (280MB/sec vs 221MB/sec) but not dramatically so. Therefore, for 
today the right approach seems to be to enhance the datanode to make sure the 
bottleneck is squarely at the client. Since the bottleneck is mainly due to 
checksum calculation and data copies out of the kernel, going direct to a blk 
file shouldn’t have a significant impact because both of these overhead 
activities need to be performed whether going direct or not. 
 
The results above are all in terms of single reader throughput of cached blk 
files. More scalability testing needs to be performed. We did verify that on a 
dual-quad core system that the datanode could scale its random read throughput 
from 137MB/sec to 480MB/sec with 4 readers.  This was enough load to saturate 5 
of the 8 cores with clients consuming 3 and datanodes consuming 2.  It’s just 
one data point, there’s lots more work to be done in the area of datanode 
scalability.

Latency is also a critical attribute of the datanode and some more data needs 
to be gathered in this area. However, I propose we focus on fixing any 
contention/latency issues within the datanode prior to diving into a direct I/O 
sort of approach (and there are already a few jiras out there that are in the 
area of improving concurrency within the datanode). If we can’t get anywhere 
near the latency requirements, then at that point we should consider more 
efficient ways of getting at the data.

Thanks to Kihwal Lee and Dave Thompson for doing a significant amount of data 
gathering! Gathering this type of data always seems to take longer than one 
would think, so thank you for the efforts.


 






> DFS read performance suboptimal when client co-located on nodes with data
> -------------------------------------------------------------------------
>
>                 Key: HDFS-347
>                 URL: https://issues.apache.org/jira/browse/HDFS-347
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: George Porter
>            Assignee: Todd Lipcon
>         Attachments: BlockReaderLocal1.txt, HADOOP-4801.1.patch, 
> HADOOP-4801.2.patch, HADOOP-4801.3.patch, HDFS-347-branch-20-append.txt, 
> all.tsv, hdfs-347.png, hdfs-347.txt, local-reads-doc
>
>
> One of the major strategies Hadoop uses to get scalable data processing is to 
> move the code to the data.  However, putting the DFS client on the same 
> physical node as the data blocks it acts on doesn't improve read performance 
> as much as expected.
> After looking at Hadoop and O/S traces (via HADOOP-4049), I think the problem 
> is due to the HDFS streaming protocol causing many more read I/O operations 
> (iops) than necessary.  Consider the case of a DFSClient fetching a 64 MB 
> disk block from the DataNode process (running in a separate JVM) running on 
> the same machine.  The DataNode will satisfy the single disk block request by 
> sending data back to the HDFS client in 64-KB chunks.  In BlockSender.java, 
> this is done in the sendChunk() method, relying on Java's transferTo() 
> method.  Depending on the host O/S and JVM implementation, transferTo() is 
> implemented as either a sendfilev() syscall or a pair of mmap() and write().  
> In either case, each chunk is read from the disk by issuing a separate I/O 
> operation for each chunk.  The result is that the single request for a 64-MB 
> block ends up hitting the disk as over a thousand smaller requests for 64-KB 
> each.
> Since the DFSClient runs in a different JVM and process than the DataNode, 
> shuttling data from the disk to the DFSClient also results in context 
> switches each time network packets get sent (in this case, the 64-kb chunk 
> turns into a large number of 1500 byte packet send operations).  Thus we see 
> a large number of context switches for each block send operation.
> I'd like to get some feedback on the best way to address this, but I think 
> providing a mechanism for a DFSClient to directly open data blocks that 
> happen to be on the same machine.  It could do this by examining the set of 
> LocatedBlocks returned by the NameNode, marking those that should be resident 
> on the local host.  Since the DataNode and DFSClient (probably) share the 
> same hadoop configuration, the DFSClient should be able to find the files 
> holding the block data, and it could directly open them and send data back to 
> the client.  This would avoid the context switches imposed by the network 
> layer, and would allow for much larger read buffers than 64KB, which should 
> reduce the number of iops imposed by each read block operation.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-347) DFS read performance suboptimal when client co-located on nodes with data

Reply via email to