Have you tried with org.apache.hadoop.util.DataChecksum and 
org.apache.hadoop.util.PureJavaCrc32 ?

- Milind

On Jan 5, 2011, at 3:42 PM, Da Zheng wrote:

> I'm not sure of that. I wrote a small checksum program for testing. After the 
> size of a block gets to larger than 8192 bytes, I don't see much performance 
> improvement. See the code below. I don't think 64MB can bring us any benefit.
> I did change io.bytes.per.checksum to 131072 in hadoop, and the program ran 
> about 4 or 5 minutes faster (the total time for reducing is about 35 minutes).
> 
> import java.util.zip.CRC32;
> import java.util.zip.Checksum;
> 
> 
> public class Test1 {
>    public static void main(String args[]) {
>        Checksum sum = new CRC32();
>        byte[] bs = new byte[512];
>        final int tot_size = 64 * 1024 * 1024;
>        long time = System.nanoTime();
>        for (int k = 0; k < tot_size / bs.length; k++) {
>            for (int i = 0; i < bs.length; i++)
>                bs[i] = (byte) i;
>            sum.update(bs, 0, bs.length);
>        }
>        System.out.println("takes " + (System.nanoTime() - time) / 1000 / 
> 1000);
>    }
> }
> 
> 
> On 01/05/2011 05:03 PM, Milind Bhandarkar wrote:
>> I agree with Jay B. Checksumming is usually the culprit for high CPU on 
>> clients and datanodes. Plus, a checksum of 4 bytes for every 512, means for 
>> 64MB block, the checksum will be 512KB, i.e. 128 ext3 blocks. Changing it to 
>> generate 1 ext3 checksum block per DFS block will speedup read/write without 
>> any loss of reliability.
>> 
>> - milind
>> 
>> ---
>> Milind Bhandarkar
>> (mbhandar...@linkedin.com)
>> (650-776-3236)
>> 
>> 
>> 
>> 
>> 
>> 
> 

---
Milind Bhandarkar
(mbhandar...@linkedin.com)
(650-776-3236)






Reply via email to