isn't DataChecksum just a wrapper of CRC32?
I'm still using Hadoop 0.20.2. there is no PureJavaCrc32
Da
On 1/5/11 7:44 PM, Milind Bhandarkar wrote:
> Have you tried with org.apache.hadoop.util.DataChecksum and
> org.apache.hadoop.util.PureJavaCrc32 ?
>
> - Milind
>
> On Jan 5, 2011, at 3:42 PM, Da Zheng wrote:
>
>> I'm not sure of that. I wrote a small checksum program for testing. After
>> the size of a block gets to larger than 8192 bytes, I don't see much
>> performance improvement. See the code below. I don't think 64MB can bring us
>> any benefit.
>> I did change io.bytes.per.checksum to 131072 in hadoop, and the program ran
>> about 4 or 5 minutes faster (the total time for reducing is about 35
>> minutes).
>>
>> import java.util.zip.CRC32;
>> import java.util.zip.Checksum;
>>
>>
>> public class Test1 {
>> public static void main(String args[]) {
>> Checksum sum = new CRC32();
>> byte[] bs = new byte[512];
>> final int tot_size = 64 * 1024 * 1024;
>> long time = System.nanoTime();
>> for (int k = 0; k < tot_size / bs.length; k++) {
>> for (int i = 0; i < bs.length; i++)
>> bs[i] = (byte) i;
>> sum.update(bs, 0, bs.length);
>> }
>> System.out.println("takes " + (System.nanoTime() - time) / 1000 /
>> 1000);
>> }
>> }
>>
>>
>> On 01/05/2011 05:03 PM, Milind Bhandarkar wrote:
>>> I agree with Jay B. Checksumming is usually the culprit for high CPU on
>>> clients and datanodes. Plus, a checksum of 4 bytes for every 512, means for
>>> 64MB block, the checksum will be 512KB, i.e. 128 ext3 blocks. Changing it
>>> to generate 1 ext3 checksum block per DFS block will speedup read/write
>>> without any loss of reliability.
>>>
>>> - milind
>>>
>>> ---
>>> Milind Bhandarkar
>>> ([email protected])
>>> (650-776-3236)
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
> ---
> Milind Bhandarkar
> ([email protected])
> (650-776-3236)
>
>
>
>
>
>