Re: CRC32 performance

Bryan Duxbury Mon, 06 Oct 2008 16:54:45 -0700

I am profiling with YourKit on random reducers. I'm also running onHDFS, so I don't know how one would go about disabling CRCs.


-Bryan


On Oct 6, 2008, at 4:35 PM, Doug Cutting wrote:

How are you profiling?  I don't trust most profilers.
Have you tried, e.g., disabling checksums and seeing how muchperformance is actually gained? For the local filesystem, you caneasily disable checksums by binding file: URI's toRawLocalFileSystem in your configuration.
Doug

Bryan Duxbury wrote:
Hey all,
I've been profiling our map/reduce applications quite a bit overthe last few weeks to try and get some performance improvements inour jobs. I noticed an interesting bottleneck in Hadoop itself Ithought I should bring up.FSDataOutputStream appears to create a CRC of the data beingwritten via FSOutputSummer.write1. It uses the built-in Java CRC32implementation to do so. However, out of a 41-second reducer mainthread, this CRC call is taking up around 13 seconds, or about32%. This appears to dwarf the actual writing time(FSOutputSummer.flushBuffer) which only takes 1.9s (5%). Thisseems like an incredibly large amount of overhead to pay.To my surprise, there's already a faster CRC implementation in theJava standard library called Adler32 which is described as "almostas reliable as a CRC-32 but can be computed much faster". Thissounds very attractive, indeed. Some quick tests indicate thatAdler32 is about 3x as fast.Is there any reason why CRC32 was chosen, or why Adler32 wouldn'tbe an acceptable CRC? I understand that Adler32 is bad for smallmessages (small as in hundreds of bytes), but since this is behinda buffered writer, the messages should all be thousands of bytesto begin with. Worst case, I guess we could select the CRCalgorithm based on the size of the message, using CRC32 for smallmessages and Adler32 for bigger ones.
-Bryan

Re: CRC32 performance

Reply via email to