I am profiling with YourKit on random reducers. I'm also running on
HDFS, so I don't know how one would go about disabling CRCs.
-Bryan
On Oct 6, 2008, at 4:35 PM, Doug Cutting wrote:
How are you profiling? I don't trust most profilers.
Have you tried, e.g., disabling checksums and seeing how much
performance is actually gained? For the local filesystem, you can
easily disable checksums by binding file: URI's to
RawLocalFileSystem in your configuration.
Doug
Bryan Duxbury wrote:
Hey all,
I've been profiling our map/reduce applications quite a bit over
the last few weeks to try and get some performance improvements in
our jobs. I noticed an interesting bottleneck in Hadoop itself I
thought I should bring up.
FSDataOutputStream appears to create a CRC of the data being
written via FSOutputSummer.write1. It uses the built-in Java CRC32
implementation to do so. However, out of a 41-second reducer main
thread, this CRC call is taking up around 13 seconds, or about
32%. This appears to dwarf the actual writing time
(FSOutputSummer.flushBuffer) which only takes 1.9s (5%). This
seems like an incredibly large amount of overhead to pay.
To my surprise, there's already a faster CRC implementation in the
Java standard library called Adler32 which is described as "almost
as reliable as a CRC-32 but can be computed much faster". This
sounds very attractive, indeed. Some quick tests indicate that
Adler32 is about 3x as fast.
Is there any reason why CRC32 was chosen, or why Adler32 wouldn't
be an acceptable CRC? I understand that Adler32 is bad for small
messages (small as in hundreds of bytes), but since this is behind
a buffered writer, the messages should all be thousands of bytes
to begin with. Worst case, I guess we could select the CRC
algorithm based on the size of the message, using CRC32 for small
messages and Adler32 for bigger ones.
-Bryan