isn't DataChecksum just a wrapper of CRC32?
I'm still using Hadoop 0.20.2. there is no PureJavaCrc32

Da

On 1/5/11 7:44 PM, Milind Bhandarkar wrote:
> Have you tried with org.apache.hadoop.util.DataChecksum and 
> org.apache.hadoop.util.PureJavaCrc32 ?
> 
> - Milind
> 
> On Jan 5, 2011, at 3:42 PM, Da Zheng wrote:
> 
>> I'm not sure of that. I wrote a small checksum program for testing. After 
>> the size of a block gets to larger than 8192 bytes, I don't see much 
>> performance improvement. See the code below. I don't think 64MB can bring us 
>> any benefit.
>> I did change io.bytes.per.checksum to 131072 in hadoop, and the program ran 
>> about 4 or 5 minutes faster (the total time for reducing is about 35 
>> minutes).
>>
>> import java.util.zip.CRC32;
>> import java.util.zip.Checksum;
>>
>>
>> public class Test1 {
>>    public static void main(String args[]) {
>>        Checksum sum = new CRC32();
>>        byte[] bs = new byte[512];
>>        final int tot_size = 64 * 1024 * 1024;
>>        long time = System.nanoTime();
>>        for (int k = 0; k < tot_size / bs.length; k++) {
>>            for (int i = 0; i < bs.length; i++)
>>                bs[i] = (byte) i;
>>            sum.update(bs, 0, bs.length);
>>        }
>>        System.out.println("takes " + (System.nanoTime() - time) / 1000 / 
>> 1000);
>>    }
>> }
>>
>>
>> On 01/05/2011 05:03 PM, Milind Bhandarkar wrote:
>>> I agree with Jay B. Checksumming is usually the culprit for high CPU on 
>>> clients and datanodes. Plus, a checksum of 4 bytes for every 512, means for 
>>> 64MB block, the checksum will be 512KB, i.e. 128 ext3 blocks. Changing it 
>>> to generate 1 ext3 checksum block per DFS block will speedup read/write 
>>> without any loss of reliability.
>>>
>>> - milind
>>>
>>> ---
>>> Milind Bhandarkar
>>> (mbhandar...@linkedin.com)
>>> (650-776-3236)
>>>
>>>
>>>
>>>
>>>
>>>
>>
> 
> ---
> Milind Bhandarkar
> (mbhandar...@linkedin.com)
> (650-776-3236)
> 
> 
> 
> 
> 
> 

Reply via email to