Re: inline checksums

Sameer Paranjpye Wed, 24 Jan 2007 10:40:49 -0800

A checksum file per block would have many CRCs, one per 64k chunk or soin the block. So it would still permit random access. The datanode wouldonly checksum the data accessed plus on average an extra 32k.

Also, if datanodes were to send the checksum after the data on a read,the client could validate the checksum and it would be end-to-end. Thesame would apply for writes.

A checksummed filesystem that embeds checksums into data makes the dataunapproachable by tools that don't anticipate checksums. In HDFS data isaccessible only via the HDFS client so this is not an issue and thechecksums can be stripped out before they reach clients. But for Localand S3 where data is accessible without going through Hadoops Filesystemimplementations this is a problem.

I'd much prefer a ChecksummedFile implementation which applications canuse when working with filesystems that don't do checksums. Even in thisimplementation it's probably better to write a side checksum file as isdone currently rather than pushing checksums into the data.



Doug Cutting wrote:

Hairong Kuang wrote:
Another option is to create a checksum file per block at the data nodewhere
the block is placed.
Yes, but then we'd need a separate checksum implementation forintermediate data, and for other distributed filesystems that don'talready guarantee end-to-end data integrity. Also, a checksum per blockwould not permit checksums on randomly accessed data withoutre-checksumming the entire block. Finally, the checksum wouldn't beend-to-end. We really want to checksum data as close to its source aspossible, then validate that checksum as close to its use as possible.
Doug

Re: inline checksums

Reply via email to