[
https://issues.apache.org/jira/browse/HDFS-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon updated HDFS-755:
-----------------------------
Attachment: hdfs-755.txt
Here's a fairly small patch which uses the support for reading multiple
checksum chunks from HADOOP-3205. I haven't run the full test suite yet, but
got about halfway through and it seems to work - I'll be sure to put it through
full testing before it gets committed. I'll also run this on a cluster and get
TestDFSIO throughput numbers.
Performance results look to be in line with what we see in HADOOP-3205.
Benchmark setup:
- I put a 700MB file on a psuedodistributed HDFS cluster.
- I did 30 "fs -cat" of this file without the patch applied, and 30 with it
applied. In both cases I did a couple cats first to make sure it was in the
buffer cache. I can run another set of benchmarks that drops cache in between
runs if people would like.
- In both benchmark cases, the patch from HADOOP-3205 was applied. I used a
64K io.file.buffer.size for both the DN and the client.
T-test results (alternative hypothesis = "with patch is faster")
Wall clock time: p-value = 2.644e-07 -> 100% confidence. 95% confidence
interval of 3.4% speedup
User time: p-value = 1.638e-10 -> 100% confidence. 95% confidence interval of
3.9% speedup
Sys time: p-value = 0.982 - that is to say above 95% confidence that we *slowed
down* sys time. The confidence interval is about 0.7%
The 95% confidence intervals in this benchmark are less impressive sounding
than the ones in HADOOP-3205 because I used fewer samples.
As to why the sys time slowed down, it's a bit of a mystery. My best guess is
that, since we're now reading from the network sockets in larger chunks, we
occasionally block in the kernel where we used to pretty much always read from
a full buffer. But, this isn't too concerning - the wall clock time is what
really matters.
> Read multiple checksum chunks at once in DFSInputStream
> -------------------------------------------------------
>
> Key: HDFS-755
> URL: https://issues.apache.org/jira/browse/HDFS-755
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs client
> Affects Versions: 0.22.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Attachments: hdfs-755.txt
>
>
> HADOOP-3205 adds the ability for FSInputChecker subclasses to read multiple
> checksum chunks in a single call to readChunk. This is the HDFS-side use of
> that new feature.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.