A checksummed filesystem that embeds checksums into data makes the data unapproachable by tools that don't anticipate checksums. In HDFS data is accessible only via the HDFS client so this is not an issue and the checksums can be stripped out before they reach clients. But for Local and S3 where data is accessible without going through Hadoops Filesystem implementations this is a problem.
For S3, it strikes me that we could put a checksum in the metadata for the block - this would be ignored by tools that aren't aware of it (if the data is also not block based - see http://www.mail-archive.com/[email protected]/msg00695.html). Blocks are written to temporary files on disk before being sent to S3, so it would be straightforward to checksum them before calling S3. S3 actually provides MD5 hashs of obejcts, but this isn't guaranteed to be supported in the future (http://developer.amazonwebservices.com/connect/thread.jspa?messageID=51645), so we should use our own checksum metadata. Tom
