[ https://issues.apache.org/jira/browse/HADOOP-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720996#action_12720996 ]
Todd Lipcon commented on HADOOP-5598: ------------------------------------- Scott: I just tried your version and was unable to get the same performance improvements. I think we've established that the pure Java definitely wins on small blocks. For large blocks, I'm seeing the following on my laptop (64-bit, with 64-bit JRE): My most recent non-evil pure-Java: 250M/sec Scott's patch that unrolls the loop: 260-280M/sec Sun Java 1.6 update 14: 333M/sec OpenJDK 1.6: 795M/sec The OpenJDK implementation is simply wrapping zlib's crc32 routine, which must be highly optimized. Given that we already have a JNI library for native compression using zlib, I'd like to simply add a stub to libhadoop that wraps zlib's crc32. That should give us the same ~800M/sec throughput for large blocks. Since we can implement the stub ourself, we also have the ability to switch to pure Java for small sizes and get the 20x speedup with no adversarial workloads that cause bad performance. On systems where the native code isn't available, we can simply use the pure Java for all sizes, since at worst it's only slightly slower than java.util.Crc32 and at best it's 30x faster. I imagine that most production systems are using libhadoop, or at least could easily get this deployed if it was shown to have significant performance benefits. I'll upload a patch later this evening for this. > Implement a pure Java CRC32 calculator > -------------------------------------- > > Key: HADOOP-5598 > URL: https://issues.apache.org/jira/browse/HADOOP-5598 > Project: Hadoop Core > Issue Type: Improvement > Components: dfs > Reporter: Owen O'Malley > Assignee: Todd Lipcon > Attachments: crc32-results.txt, hadoop-5598-evil.txt, > hadoop-5598-hybrid.txt, hadoop-5598.txt, hadoop-5598.txt, PureJavaCrc32.java, > TestCrc32Performance.java, TestCrc32Performance.java > > > We've seen a reducer writing 200MB to HDFS with replication = 1 spending a > long time in crc calculation. In particular, it was spending 5 seconds in crc > calculation out of a total of 6 for the write. I suspect that it is the > java-jni border that is causing us grief. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.