[ 
https://issues.apache.org/jira/browse/HADOOP-6148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731721#action_12731721
 ] 

Scott Carey commented on HADOOP-6148:
-------------------------------------

Whatever that Timer class is that is being used to measure, it is using the 
system millisecond time, which has only ~15ms accuracy so the test run time 
needs to be long to be accurate.  To get more accurate results, use 
System.nanoTime().
Also, I have found that a couple 'warmup' iterations of a test make the results 
much more consistent, to avoid the JIT interfering.

Using the benchmark I did before (previously attached, 
TestCrc32Performance.java), the new version is consistently 10% slower than the 
previous on my machine (Java 6, Max OS X, Core2 Duo processor, 64 bit JVM).  On 
Sun JRE 6.0_u14 on Linux (64 bit) with different CPUs, results vary.  I'll dig 
into those details below.

Results should be normalized to a metric we can compare to -- we have been 
using MB/sec so far.  Additionally, the JVM and environment used is critical.  
Java 1.5 behaves VERY differently and I would expect changes like this to 
behave differently there (as well as with OpenJDK, IBM, or JRockit).

There are two changes here -- the loop format and termination, and how the 
shifts and masks are packed.

Adding or removing the final declarations makes no difference in my testing -- 
the compiler easily identifies whether variables change or not.

These two changes, taken alone, or taken together have varied results.  This is 
probably because loop unrolling in the JIT behaves differently.  Generally, the 
loop change helped the least, it only avoids a decrement on a loop condition, 
which is essentially free for some processors, and it has higher start-up cost 
and more variables.

Here are results on my laptop, followed by two different servers.

Note, that the first two results in any test are always lower than the rest, 
regardless of what order i do the tests, so consider the 'size 1' scores 
suspect.

Here are my results with the new version posted on my laptop Core2 Duo -- 
2.5Ghz, 4GB 667Mhz DDR2 SDRAM, OSX 10.5.7 Java 6 (Apple):

$ java -Xmx512m -cp testall.jar org.apache.hadoop.util.TestCrc32Performance
||num bytes||NewLoopOnly MB/sec||NewInnerOnly MB/sec||NewPureJava 
MB/sec||PureJava MB/sec||Native MB/sec||
| 1     |85.195         |94.160         |121.054        |117.785        |6.580  
|
| 2     |123.629        |133.855        |131.112        |164.309        |12.354 
        |
| 4     |226.677        |220.528        |194.145        |287.520        |24.309 
        |
| 8     |240.169        |262.491        |253.665        |283.975        |45.353 
        |
| 16    |343.329        |364.299        |354.749        |383.966        |77.207 
        |
| 32    |441.347        |445.800        |433.381        |462.508        
|122.829        |
| 64    |522.195        |522.210        |502.894        |528.559        
|188.184        |
| 128   |570.476        |551.149        |541.540        |555.572        
|194.542        |
| 256   |596.147        |577.042        |558.884        |591.713        
|289.183        |
| 512   |601.956        |583.593        |561.882        |593.323        
|315.714        |
| 1024  |623.217        |592.406        |577.292        |603.304        
|332.319        |
| 2048  |623.979        |594.163        |581.419        |606.365        
|341.448        |
| 4096  |624.345        |596.289        |584.365        |610.018        
|344.685        |
| 8192  |626.711        |593.323        |585.424        |607.542        
|347.891        |
| 16384 |625.938        |599.650        |584.414        |607.903        
|350.221        |
| 32768 |623.995        |583.516        |577.771        |609.930        
|349.754        |
| 65536 |623.906        |594.321        |578.602        |610.338        
|347.915        |
| 131072        |624.308        |595.950        |577.024        |610.308        
|350.647        |
| 262144        |629.946        |590.831        |577.453        |610.285        
|351.603        |
| 524288        |623.757        |597.578        |575.428        |610.399        
|349.063        |
| 1048576       |624.043        |596.817        |577.521        |610.798        
|352.213        |
| 2097152       |627.303        |591.817        |573.852        |610.981        
|352.406        |
| 4194304       |623.168        |593.048        |570.512        |605.679        
|347.898        |
| 8388608       |609.118        |587.879        |562.892        |600.384        
|344.380        |
| 16777216      |610.116        |585.480        |555.988        |601.001        
|348.796        |

For the above, the new loop helps, the new inner code hurts, and combined it is 
worst.  For small checksum sizes all are worse.

For a Linux server (Dual quad core Xeon E5335  @ 2.00GHz), Sun JDK 1.6.0_u14 
the results are different.
The Loop change does not help, changing the inner code alone helps the most, 
and combining the two is somewhere in-between, and for very small sizes its 
always slower:

java -Xmx512m -XX:+UseParallelOldGC -XX:+UseCompressedOops -cp testall.jar 
org.apache.hadoop.util.TestCrc32Performance
||num bytes||NewLoopOnly MB/sec||NewInnerOnly MB/sec||NewPureJava 
MB/sec||PureJava MB/sec||Native MB/sec||
| 1     |66.281         |66.264         |85.825         |93.888         |7.331  
|
| 2     |93.812         |94.895         |92.638         |129.917        |13.931 
        |
| 4     |155.431        |144.819        |161.540        |178.275        |26.586 
        |
| 8     |174.531        |185.222        |206.577        |185.253        |47.892 
        |
| 16    |244.490        |275.026        |292.237        |256.664        |81.922 
        |
| 32    |306.768        |351.271        |345.710        |313.523        
|127.560        |
| 64    |350.763        |407.539        |382.370        |352.997        
|175.000        |
| 128   |377.402        |442.239        |406.650        |376.721        
|218.138        |
| 256   |392.901        |461.884        |420.526        |390.215        
|247.586        |
| 512   |397.682        |467.375        |425.522        |393.319        
|264.732        |
| 1024  |404.145        |474.192        |430.791        |391.678        
|272.883        |
| 2048  |407.673        |478.888        |433.719        |400.236        
|277.638        |
| 4096  |409.028        |480.907        |435.645        |401.338        
|280.214        |
| 8192  |409.890        |482.769        |436.209        |402.037        
|281.782        |
| 16384 |409.565        |482.726        |436.340        |401.767        
|282.409        |
| 32768 |407.943        |481.176        |436.373        |399.955        
|282.455        |
| 65536 |407.933        |481.746        |435.970        |400.228        
|282.761        |
| 131072        |408.003        |481.723        |436.516        |399.973        
|282.749        |
| 262144        |407.412        |481.357        |436.067        |400.962        
|283.193        |
| 524288        |408.077        |481.335        |436.416        |401.280        
|283.137        |
| 1048576       |408.016        |481.625        |436.086        |402.039        
|283.308        |
| 2097152       |407.397        |481.386        |436.131        |401.394        
|283.353        |
| 4194304       |406.609        |479.130        |434.960        |400.632        
|282.376        |
| 8388608       |403.235        |475.130        |430.770        |397.797        
|280.904        |
| 16777216      |402.891        |474.464        |430.427        |397.324        
|280.951        |


Lastly, I have access to a dual quad core Nehalem system Xeon X5550  @ 2.67GHz. 
  The trend is similar to the other Linux server.
||num bytes||NewLoopOnly MB/sec||NewInnerOnly MB/sec||NewPureJava 
MB/sec||PureJava MB/sec||Native MB/sec||
| 1     |100.809        |105.911        |124.574        |168.577        |17.863 
        |
| 2     |144.671        |157.576        |150.993        |203.188        |33.722 
        |
| 4     |230.455        |226.380        |250.517        |276.617        |64.370 
        |
| 8     |263.961        |288.727        |308.807        |292.203        
|103.247        |
| 16    |345.860        |419.419        |454.893        |394.811        
|158.828        |
| 32    |437.694        |536.016        |534.576        |470.360        
|226.527        |
| 64    |503.768        |611.457        |579.010        |470.782        
|281.814        |
| 128   |537.179        |650.089        |614.528        |522.099        
|317.293        |
| 256   |557.598        |672.534        |630.303        |558.545        
|344.511        |
| 512   |561.642        |677.876        |611.557        |463.531        
|345.938        |
| 1024  |565.044        |690.742        |615.121        |558.021        
|367.634        |
| 2048  |583.641        |714.832        |634.747        |602.027        
|372.008        |
| 4096  |594.691        |717.332        |643.951        |598.466        
|369.619        |
| 8192  |562.813        |639.987        |596.214        |560.489        
|366.227        |
| 16384 |540.490        |682.933        |481.701        |571.237        
|373.531        |
| 32768 |520.446        |636.502        |563.346        |539.832        
|333.878        |
| 65536 |498.553        |617.019        |556.558        |528.404        
|343.788        |
| 131072        |526.779        |625.905        |585.023        |520.933        
|333.155        |
| 262144        |535.428        |617.910        |568.184        |530.084        
|337.989        |
| 524288        |563.442        |637.368        |623.922        |578.144        
|379.028        |
| 1048576       |578.139        |709.259        |632.558        |596.539        
|374.711        |
| 2097152       |537.883        |662.951        |634.672        |583.292        
|363.072        |
| 4194304       |551.819        |677.231        |632.805        |580.628        
|372.185        |
| 8388608       |581.055        |689.236        |624.939        |584.958        
|358.719        |
| 16777216      |563.078        |675.309        |566.319        |579.324        
|365.306        |

So it looks like the winner is to change the inner part of the loop only.  
Although this hurts a bit on Apple's VM, that isn't a production VM.  Unless 
there is data for some other VM's that suggests this is a poor choice it looks 
best to me.  It is slightly slower for sizes less than about 8 bytes however  
The code is:

{code}  public void update(byte[] b, int off, int len) {
    while(len > 3) {
      int c0 = crc ^ b[off++];
      int c1 = (crc >>>= 8) ^ b[off++];
      int c2 = (crc >>>= 8) ^ b[off++];
      int c3 = (crc >>>= 8) ^ b[off++];
      crc = T4[c0 & 0xff] ^ T3[c1 & 0xff] ^ T2[c2 & 0xff] ^ T1[c3 & 0xff];
      len -=4;
    }
    while(len > 0) {
      crc = (crc >>> 8) ^ T1[(crc ^ b[off++]) & 0xff];
      len --;
    }
  }{code}


> Implement a pure Java CRC32 calculator
> --------------------------------------
>
>                 Key: HADOOP-6148
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6148
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Owen O'Malley
>            Assignee: Todd Lipcon
>         Attachments: benchmarks20090714.txt, benchmarks20090715.txt, 
> crc32-results.txt, hadoop-5598-evil.txt, hadoop-5598-hybrid.txt, 
> hadoop-5598.txt, hadoop-5598.txt, hdfs-297.txt, PureJavaCrc32.java, 
> PureJavaCrc32.java, PureJavaCrc32.java, PureJavaCrc32.java, 
> TestCrc32Performance.java, TestCrc32Performance.java, 
> TestCrc32Performance.java, TestPureJavaCrc32.java
>
>
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a 
> long time in crc calculation. In particular, it was spending 5 seconds in crc 
> calculation out of a total of 6 for the write. I suspect that it is the 
> java-jni border that is causing us grief.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to