Re: CRC32 performance

Doug Cutting Mon, 06 Oct 2008 16:36:41 -0700

How are you profiling?  I don't trust most profilers.

Have you tried, e.g., disabling checksums and seeing how muchperformance is actually gained? For the local filesystem, you caneasily disable checksums by binding file: URI's to RawLocalFileSystem inyour configuration.


Doug

Bryan Duxbury wrote:

Hey all,
I've been profiling our map/reduce applications quite a bit over thelast few weeks to try and get some performance improvements in our jobs.I noticed an interesting bottleneck in Hadoop itself I thought I shouldbring up.
FSDataOutputStream appears to create a CRC of the data being written viaFSOutputSummer.write1. It uses the built-in Java CRC32 implementation todo so. However, out of a 41-second reducer main thread, this CRC call istaking up around 13 seconds, or about 32%. This appears to dwarf theactual writing time (FSOutputSummer.flushBuffer) which only takes 1.9s(5%). This seems like an incredibly large amount of overhead to pay.
To my surprise, there's already a faster CRC implementation in the Javastandard library called Adler32 which is described as "almost asreliable as a CRC-32 but can be computed much faster". This sounds veryattractive, indeed. Some quick tests indicate that Adler32 is about 3xas fast.
Is there any reason why CRC32 was chosen, or why Adler32 wouldn't be anacceptable CRC? I understand that Adler32 is bad for small messages(small as in hundreds of bytes), but since this is behind a bufferedwriter, the messages should all be thousands of bytes to begin with.Worst case, I guess we could select the CRC algorithm based on the sizeof the message, using CRC32 for small messages and Adler32 for bigger ones.
-Bryan

Re: CRC32 performance

Reply via email to