Have you tried with org.apache.hadoop.util.DataChecksum and org.apache.hadoop.util.PureJavaCrc32 ?
- Milind On Jan 5, 2011, at 3:42 PM, Da Zheng wrote: > I'm not sure of that. I wrote a small checksum program for testing. After the > size of a block gets to larger than 8192 bytes, I don't see much performance > improvement. See the code below. I don't think 64MB can bring us any benefit. > I did change io.bytes.per.checksum to 131072 in hadoop, and the program ran > about 4 or 5 minutes faster (the total time for reducing is about 35 minutes). > > import java.util.zip.CRC32; > import java.util.zip.Checksum; > > > public class Test1 { > public static void main(String args[]) { > Checksum sum = new CRC32(); > byte[] bs = new byte[512]; > final int tot_size = 64 * 1024 * 1024; > long time = System.nanoTime(); > for (int k = 0; k < tot_size / bs.length; k++) { > for (int i = 0; i < bs.length; i++) > bs[i] = (byte) i; > sum.update(bs, 0, bs.length); > } > System.out.println("takes " + (System.nanoTime() - time) / 1000 / > 1000); > } > } > > > On 01/05/2011 05:03 PM, Milind Bhandarkar wrote: >> I agree with Jay B. Checksumming is usually the culprit for high CPU on >> clients and datanodes. Plus, a checksum of 4 bytes for every 512, means for >> 64MB block, the checksum will be 512KB, i.e. 128 ext3 blocks. Changing it to >> generate 1 ext3 checksum block per DFS block will speedup read/write without >> any loss of reliability. >> >> - milind >> >> --- >> Milind Bhandarkar >> (mbhandar...@linkedin.com) >> (650-776-3236) >> >> >> >> >> >> > --- Milind Bhandarkar (mbhandar...@linkedin.com) (650-776-3236)