A checksummed filesystem that embeds checksums into data makes the data
unapproachable by tools that don't anticipate checksums. In HDFS data is
accessible only via the HDFS client so this is not an issue and the
checksums can be stripped out before they reach clients. But for Local
and S3 where data is accessible without going through Hadoops Filesystem
implementations this is a problem.

For S3, it strikes me that we could put a checksum in the metadata for
the block - this would be ignored by tools that aren't aware of it (if
the data is also not block based - see
http://www.mail-archive.com/[email protected]/msg00695.html).
Blocks are written to temporary files on disk before being sent to S3,
so it would be straightforward to checksum them before calling S3.

S3 actually provides MD5 hashs of obejcts, but this isn't guaranteed
to be supported in the future
(http://developer.amazonwebservices.com/connect/thread.jspa?messageID=51645),
so we should use our own checksum metadata.

Tom

Reply via email to