isn't DataChecksum just a wrapper of CRC32? I'm still using Hadoop 0.20.2. there is no PureJavaCrc32
Da On 1/5/11 7:44 PM, Milind Bhandarkar wrote: > Have you tried with org.apache.hadoop.util.DataChecksum and > org.apache.hadoop.util.PureJavaCrc32 ? > > - Milind > > On Jan 5, 2011, at 3:42 PM, Da Zheng wrote: > >> I'm not sure of that. I wrote a small checksum program for testing. After >> the size of a block gets to larger than 8192 bytes, I don't see much >> performance improvement. See the code below. I don't think 64MB can bring us >> any benefit. >> I did change io.bytes.per.checksum to 131072 in hadoop, and the program ran >> about 4 or 5 minutes faster (the total time for reducing is about 35 >> minutes). >> >> import java.util.zip.CRC32; >> import java.util.zip.Checksum; >> >> >> public class Test1 { >> public static void main(String args[]) { >> Checksum sum = new CRC32(); >> byte[] bs = new byte[512]; >> final int tot_size = 64 * 1024 * 1024; >> long time = System.nanoTime(); >> for (int k = 0; k < tot_size / bs.length; k++) { >> for (int i = 0; i < bs.length; i++) >> bs[i] = (byte) i; >> sum.update(bs, 0, bs.length); >> } >> System.out.println("takes " + (System.nanoTime() - time) / 1000 / >> 1000); >> } >> } >> >> >> On 01/05/2011 05:03 PM, Milind Bhandarkar wrote: >>> I agree with Jay B. Checksumming is usually the culprit for high CPU on >>> clients and datanodes. Plus, a checksum of 4 bytes for every 512, means for >>> 64MB block, the checksum will be 512KB, i.e. 128 ext3 blocks. Changing it >>> to generate 1 ext3 checksum block per DFS block will speedup read/write >>> without any loss of reliability. >>> >>> - milind >>> >>> --- >>> Milind Bhandarkar >>> (mbhandar...@linkedin.com) >>> (650-776-3236) >>> >>> >>> >>> >>> >>> >> > > --- > Milind Bhandarkar > (mbhandar...@linkedin.com) > (650-776-3236) > > > > > >