A checksum file per block would have many CRCs, one per 64k chunk or so
in the block. So it would still permit random access. The datanode would
only checksum the data accessed plus on average an extra 32k.
Also, if datanodes were to send the checksum after the data on a read,
the client could validate the checksum and it would be end-to-end. The
same would apply for writes.
A checksummed filesystem that embeds checksums into data makes the data
unapproachable by tools that don't anticipate checksums. In HDFS data is
accessible only via the HDFS client so this is not an issue and the
checksums can be stripped out before they reach clients. But for Local
and S3 where data is accessible without going through Hadoops Filesystem
implementations this is a problem.
I'd much prefer a ChecksummedFile implementation which applications can
use when working with filesystems that don't do checksums. Even in this
implementation it's probably better to write a side checksum file as is
done currently rather than pushing checksums into the data.
Doug Cutting wrote:
Hairong Kuang wrote:
Another option is to create a checksum file per block at the data node
where
the block is placed.
Yes, but then we'd need a separate checksum implementation for
intermediate data, and for other distributed filesystems that don't
already guarantee end-to-end data integrity. Also, a checksum per block
would not permit checksums on randomly accessed data without
re-checksumming the entire block. Finally, the checksum wouldn't be
end-to-end. We really want to checksum data as close to its source as
possible, then validate that checksum as close to its use as possible.
Doug
- Re: inline checksums Sameer Paranjpye
-