[ 
https://issues.apache.org/jira/browse/HADOOP-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HADOOP-6166:
--------------------------------

    Attachment: Rplots-nehalem64.pdf
                Rplots-laptop.pdf
                Rplots-nehalem32.pdf

Looks like the benchmark has run long enough to get good data. Here are the 
benchmarks from TestPureJavaCrc32 run on three different test systems. 
nehalem32 is the same nehalem box (3MB L2 cache) running a 32-bit JVM. 
nehalem64 is that box with a 64-bit JVM. "laptop" is my MacBook Pro (Core 2 
duo) running a 64-bit JVM.

Each PDF has several pages:
  - The first graph shows performance over the whole byte range tested. You'll 
definitely have to zoom in to be able to see anything here, and even then it's 
not that useful.
  - The remaining graphs show the different algorithms' performance on 
different sizes (same as the tables people have been pasting into JIRA)

I ran the whole benchmark suite 50+ times to generate the error bars. Hopefully 
they'll serve as a good visual indicator for where the differences are actually 
statistically significant.

In summary, here's how I interpret the data:

- For the 4-byte case, PureJavaCrc32 wins out on my laptop and the 32-bit JVM 
by a strong margin. On the 64-bit JVM it's within 5-10% of the rest (very 
little difference)
- The 8-byte case is interesting - all of the 16_16* CRCs perform worse then 
the _8_8 CRCs. On the 32-bit JDK it's especially obvious (nearly a factor of 
two)
- The 512-byte case (probably most common for DFS) - everyone is pretty much 
neck and neck. The 8_8d implementation wins significantly on nehalem64, and the 
8_8b wins significantly on nehalem32. On my laptop they're all within the error 
bars except for 16_16 which is significantly worse
- The 16MB case is the same as the 512-byte case, just more pronounced. 8_8d 
wins on 64-bit, 8_8b on 32-bit, both by about 10%.

So, I think the next step here is to profile a couple of MR applications to see 
what sizes are most common.

My personal opinion is that we should target the 64-bit Nehalem architecture 
and the 128-byte size range. This would point to the 8_8d implementation as the 
winner.

> Improve PureJavaCrc32
> ---------------------
>
>                 Key: HADOOP-6166
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6166
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: c6166_20090722.patch, c6166_20090722_benchmark_32VM.txt, 
> c6166_20090722_benchmark_64VM.txt, c6166_20090727.patch, 
> c6166_20090728.patch, c6166_20090810.patch, c6166_20090811.patch, graph.r, 
> graph.r, Rplots-laptop.pdf, Rplots-nehalem32.pdf, Rplots-nehalem64.pdf, 
> Rplots.pdf, Rplots.pdf, Rplots.pdf
>
>
> Got some ideas to improve CRC32 calculation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to