A checksum file per block would have many CRCs, one per 64k chunk or so in the block. So it would still permit random access. The datanode would only checksum the data accessed plus on average an extra 32k.

Also, if datanodes were to send the checksum after the data on a read, the client could validate the checksum and it would be end-to-end. The same would apply for writes.

A checksummed filesystem that embeds checksums into data makes the data unapproachable by tools that don't anticipate checksums. In HDFS data is accessible only via the HDFS client so this is not an issue and the checksums can be stripped out before they reach clients. But for Local and S3 where data is accessible without going through Hadoops Filesystem implementations this is a problem.

I'd much prefer a ChecksummedFile implementation which applications can use when working with filesystems that don't do checksums. Even in this implementation it's probably better to write a side checksum file as is done currently rather than pushing checksums into the data.


Doug Cutting wrote:
Hairong Kuang wrote:
Another option is to create a checksum file per block at the data node where
the block is placed.

Yes, but then we'd need a separate checksum implementation for intermediate data, and for other distributed filesystems that don't already guarantee end-to-end data integrity. Also, a checksum per block would not permit checksums on randomly accessed data without re-checksumming the entire block. Finally, the checksum wouldn't be end-to-end. We really want to checksum data as close to its source as possible, then validate that checksum as close to its use as possible.

Doug

Reply via email to