[ 
https://issues.apache.org/jira/browse/HADOOP-3205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HADOOP-3205:
--------------------------------

    Attachment: hadoop-3205.txt

Here's a patch that implements the FSInputChecker side of this ticket.

Benchmark results are promising. I put a 700MB file in /dev/shm with its 
associated checksum and then timed "hadoop fs -cat /dev/shm/bigfile" 100 times 
with the patch and without the patch. Here is R output from the analysis of 
these times:

{noformat}
> p.user <- read.table(file="/tmp/times.patch.user")
> p.sys <- read.table(file="/tmp/times.patch.sys")
> p.wall <- read.table(file="/tmp/times.patch.wall")
> t.user <- read.table(file="/tmp/times.trunk.user")
> t.sys <- read.table(file="/tmp/times.trunk.sys")
> t.wall <- read.table(file="/tmp/times.trunk.wall")
> t.test(t.user,p.user,alternative="greater")

        Welch Two Sample t-test

data:  t.user and p.user 
t = 21.0552, df = 134.54, p-value < 2.2e-16
alternative hypothesis: true difference in means is greater than 0 
95 percent confidence interval:
 0.4654936       Inf 
sample estimates:
mean of x mean of y 
 3.713000  3.207763 

> 3.2077/3.713
[1] 0.8639106
> t.test(t.sys,p.sys,alternative="greater")

        Welch Two Sample t-test

data:  t.sys and p.sys 
t = 1.3567, df = 137.286, p-value = 0.08856
alternative hypothesis: true difference in means is greater than 0 
95 percent confidence interval:
 -0.003768599          Inf 
sample estimates:
mean of x mean of y 
 0.980500  0.963421 

> t.test(t.wall,p.wall,alternative="greater")

        Welch Two Sample t-test

data:  t.wall and p.wall 
t = 6.5711, df = 118.318, p-value = 7.034e-10
alternative hypothesis: true difference in means is greater than 0 
95 percent confidence interval:
 0.3020628       Inf 
sample estimates:
mean of x mean of y 
 7.667800  7.263816
{noformat}

To interpret the results for those who don't know R:
- The user time is reduced with 100% confidence. With 95% confidence it's 
reduced by at least 0.465s = 12.5%
- The sys time is not significantly reduced - p > 0.05. This is consistent with 
our expectation that we're doing the same number of syscalls, just avoiding 
buffer copies in user space.
- Wall clock time is reduced with 100% confidence. With 95% confidence it's 
reduced by at least 0.302s = 3.9%.

I didn't include the R output, but analyis on the "CPU%" column of the "time" 
results gives 100% confidence of a reduction in CPU percent util, 95% 
confidence of at least 3.34%.

The patch itself can probably be improved - just wanted to get early comments. 
I did briefly test that HDFS still functions, but have not run through all the 
unit tests. I also want to rerun the above benchmarks with io.file.buffer.size 
tuned up to 64K or 128K as most people do in production.

> FSInputChecker and FSOutputSummer should allow better access to user buffer
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-3205
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3205
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs
>            Reporter: Raghu Angadi
>            Assignee: Raghu Angadi
>         Attachments: hadoop-3205.txt
>
>
> Implementations of FSInputChecker and FSOutputSummer like DFS do not have 
> access to full user buffer. At any time DFS can access only up to 512 bytes 
> even though user usually reads with a much larger buffer (often controlled by 
> io.file.buffer.size). This requires implementations to double buffer data if 
> an implementation wants to read or write larger chunks of data from 
> underlying storage.
> We could separate changes for FSInputChecker and FSOutputSummer into two 
> separate jiras.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to