I put together a small benchmark app just for the two CRC algorithms
alone (code available on request). I run the same amount of data
through each one in exactly the same pattern. The results look like
this:
Adler32: 1983 ms
CRC32: 6514 ms
Ratio: 0.30442125
The ratio holds for different lengths of tests, too. This would seem
to indicate that there's a fair bit of benefit to be extracted from
upgrading to Adler. From looking at the HDFS code, it even seems like
it's designed to work with different implementations of Checksum, so
it doesn't seem like it would be hard to use this instead.
I might still take the time to build an isolated benchmark that's
actually the hadoop code, but I thought I'd share these intermediate
results.
-Bryan
On Oct 7, 2008, at 10:31 AM, Doug Cutting wrote:
Don't try this on anything but an experimental filesystem. If you
can simply find the places where HDFS calls the CRC algorithm and
replace them with zeros, then you should be able to get a
reasonable benchmark.
Doug
Bryan Duxbury wrote:
I'm willing to give this a shot. Let me just be sure I understand
what I'd have to do: if I make it stop computing CRCs altogether,
I need to make changes in the datanode as well, right? To stop
checking validity of CRCs? Will this break anything interesting
and unexpected?
On Oct 6, 2008, at 4:58 PM, Doug Cutting wrote:
Bryan Duxbury wrote:
I am profiling with YourKit on random reducers. I'm also running
on HDFS, so I don't know how one would go about disabling CRCs.
Hack the CRC-computing code to fill things with zeros?
Doug