[
https://issues.apache.org/jira/browse/HADOOP-3205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon updated HADOOP-3205:
--------------------------------
Attachment: hadoop-3205.txt
Here's a patch that implements the FSInputChecker side of this ticket.
Benchmark results are promising. I put a 700MB file in /dev/shm with its
associated checksum and then timed "hadoop fs -cat /dev/shm/bigfile" 100 times
with the patch and without the patch. Here is R output from the analysis of
these times:
{noformat}
> p.user <- read.table(file="/tmp/times.patch.user")
> p.sys <- read.table(file="/tmp/times.patch.sys")
> p.wall <- read.table(file="/tmp/times.patch.wall")
> t.user <- read.table(file="/tmp/times.trunk.user")
> t.sys <- read.table(file="/tmp/times.trunk.sys")
> t.wall <- read.table(file="/tmp/times.trunk.wall")
> t.test(t.user,p.user,alternative="greater")
Welch Two Sample t-test
data: t.user and p.user
t = 21.0552, df = 134.54, p-value < 2.2e-16
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
0.4654936 Inf
sample estimates:
mean of x mean of y
3.713000 3.207763
> 3.2077/3.713
[1] 0.8639106
> t.test(t.sys,p.sys,alternative="greater")
Welch Two Sample t-test
data: t.sys and p.sys
t = 1.3567, df = 137.286, p-value = 0.08856
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-0.003768599 Inf
sample estimates:
mean of x mean of y
0.980500 0.963421
> t.test(t.wall,p.wall,alternative="greater")
Welch Two Sample t-test
data: t.wall and p.wall
t = 6.5711, df = 118.318, p-value = 7.034e-10
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
0.3020628 Inf
sample estimates:
mean of x mean of y
7.667800 7.263816
{noformat}
To interpret the results for those who don't know R:
- The user time is reduced with 100% confidence. With 95% confidence it's
reduced by at least 0.465s = 12.5%
- The sys time is not significantly reduced - p > 0.05. This is consistent with
our expectation that we're doing the same number of syscalls, just avoiding
buffer copies in user space.
- Wall clock time is reduced with 100% confidence. With 95% confidence it's
reduced by at least 0.302s = 3.9%.
I didn't include the R output, but analyis on the "CPU%" column of the "time"
results gives 100% confidence of a reduction in CPU percent util, 95%
confidence of at least 3.34%.
The patch itself can probably be improved - just wanted to get early comments.
I did briefly test that HDFS still functions, but have not run through all the
unit tests. I also want to rerun the above benchmarks with io.file.buffer.size
tuned up to 64K or 128K as most people do in production.
> FSInputChecker and FSOutputSummer should allow better access to user buffer
> ---------------------------------------------------------------------------
>
> Key: HADOOP-3205
> URL: https://issues.apache.org/jira/browse/HADOOP-3205
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs
> Reporter: Raghu Angadi
> Assignee: Raghu Angadi
> Attachments: hadoop-3205.txt
>
>
> Implementations of FSInputChecker and FSOutputSummer like DFS do not have
> access to full user buffer. At any time DFS can access only up to 512 bytes
> even though user usually reads with a much larger buffer (often controlled by
> io.file.buffer.size). This requires implementations to double buffer data if
> an implementation wants to read or write larger chunks of data from
> underlying storage.
> We could separate changes for FSInputChecker and FSOutputSummer into two
> separate jiras.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.