[jira] Updated: (HDFS-347) DFS read performance suboptimal when client co-located on nodes with data

Todd Lipcon (JIRA) Sun, 20 Dec 2009 20:48:49 -0800

     [ 
https://issues.apache.org/jira/browse/HDFS-347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Todd Lipcon updated HDFS-347:
-----------------------------

    Attachment: hdfs-347.png
                all.tsv

Took some time to rebase this work against trunk (with HADOOP-5205 and HDFS-755 
patched in as well). Here's a graph (and the data that made it) comparing the 
following:
 - checksumfs.tsv - reading a file:/// URL with an associated checksum file on 
my local disk
 - raw.tsv - reading the same file, but with no checksum file
 - without.tsv - pseudo-distributed HDFS with dfs.client.use.unix.sockets=false
 - with.tsv - same HDFS, but with dfs.client.use.unix.sockets=true

For all of these tests, I used a 691MB file, and double checked md5sum output 
to make sure they were all reading it correctly. Each box plot shows the 
distribution of 50 trials of fs -cat /path/to/file. io.file.buffer.size was set 
to 64K for all trials.

The big surprise here is that somehow HDFS with this patch came out faster than 
ChecksumFileSystem. The sys time for the same doesn't show any difference, but 
HDFS is using less CPU time. Since this doesn't make much sense, I reran both 
the HDFS and ChecksumFs benchmarks a second time and the results were the same. 
If anyone cares to wager a guess about how this could be possible, I'd 
appreciate it :) Otherwise, I will try to dig into this.

The inclusion of raw shows the same 200-300% difference referenced in earlier 
comments in this jira. There's no optimization we can make here aside from 
speeding up checksumming. The HADOOP-5205/HDFS-755 patches improved this a bit, 
but it's still the major difference. As noted above, this patch makes reading 
from the local DN perform at least as well as reading from a local checksummed 
system (if not inexplicably better).

> DFS read performance suboptimal when client co-located on nodes with data
> -------------------------------------------------------------------------
>
>                 Key: HDFS-347
>                 URL: https://issues.apache.org/jira/browse/HDFS-347
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: George Porter
>            Assignee: Todd Lipcon
>         Attachments: all.tsv, HADOOP-4801.1.patch, HADOOP-4801.2.patch, 
> HADOOP-4801.3.patch, hdfs-347.png, hdfs-347.txt, local-reads-doc
>
>
> One of the major strategies Hadoop uses to get scalable data processing is to 
> move the code to the data.  However, putting the DFS client on the same 
> physical node as the data blocks it acts on doesn't improve read performance 
> as much as expected.
> After looking at Hadoop and O/S traces (via HADOOP-4049), I think the problem 
> is due to the HDFS streaming protocol causing many more read I/O operations 
> (iops) than necessary.  Consider the case of a DFSClient fetching a 64 MB 
> disk block from the DataNode process (running in a separate JVM) running on 
> the same machine.  The DataNode will satisfy the single disk block request by 
> sending data back to the HDFS client in 64-KB chunks.  In BlockSender.java, 
> this is done in the sendChunk() method, relying on Java's transferTo() 
> method.  Depending on the host O/S and JVM implementation, transferTo() is 
> implemented as either a sendfilev() syscall or a pair of mmap() and write().  
> In either case, each chunk is read from the disk by issuing a separate I/O 
> operation for each chunk.  The result is that the single request for a 64-MB 
> block ends up hitting the disk as over a thousand smaller requests for 64-KB 
> each.
> Since the DFSClient runs in a different JVM and process than the DataNode, 
> shuttling data from the disk to the DFSClient also results in context 
> switches each time network packets get sent (in this case, the 64-kb chunk 
> turns into a large number of 1500 byte packet send operations).  Thus we see 
> a large number of context switches for each block send operation.
> I'd like to get some feedback on the best way to address this, but I think 
> providing a mechanism for a DFSClient to directly open data blocks that 
> happen to be on the same machine.  It could do this by examining the set of 
> LocatedBlocks returned by the NameNode, marking those that should be resident 
> on the local host.  Since the DataNode and DFSClient (probably) share the 
> same hadoop configuration, the DFSClient should be able to find the files 
> holding the block data, and it could directly open them and send data back to 
> the client.  This would avoid the context switches imposed by the network 
> layer, and would allow for much larger read buffers than 64KB, which should 
> reduce the number of iops imposed by each read block operation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-347) DFS read performance suboptimal when client co-located on nodes with data

Reply via email to