Another option is to create a checksum file per block at the data node where
the block is placed. This approach clearly separates data and checksums and
does not requires too much changes for open(), seek() and length(). For
create, when a block is written to a data node, the data node creates a
checksum file at the same time.

We could have the same checksum file naming convention as it is now. A
checksum file is named after its block file name with a "." prefix and a
".crc" suffix.

When upgrade, a name node removes all checksum files from its namespace.
Data nodes create a checksum file per block if the checksum file does not
exist.

Hairong

-----Original Message-----
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, January 23, 2007 3:51 PM
To: [email protected]
Subject: inline checksums

The current checksum implementation writes CRC32 values to a parallel file.
Unfortunately these parallel files pollute the namespace.  In particular,
this places a heavier burden on the HDFS namenode.

Perhaps we should consider placing checksums inline in file data.  For
example, we might write the data as a sequence of fixed-size
<checksum><payload> entries.  This could be implemented as a FileSystem
wrapper, ChecksummedFileSystem.  The create() method would return a stream
that uses a small buffer that checksums data as it arrives, then writes the
checksums in front of the data as the buffer is flushed.  The
open() method could similarly check each buffer as it is read.  The
seek() and length() methods would adjust for the interpolated checksums.

Checksummed files could have their names suffixed internally with something
like ".hcs0".  Checksum processing would be skipped for files without this
suffix, for back-compatibility and interoperability. 
Directory listings would be modified to remove this suffix.

Existing checksum code in FileSystem.java could be removed, including all
'raw' methods.

HDFS would use ChecksummedFileSystem.  If block names were modified to
encode the checksum version, then datanodes could validate checksums. 
(We could ensure that checksum boundaries are aligned with block
boundaries.)

We could have two versions of the local filesystem: one with checksums and
one without.  The DFS shell could use the checksumless version for exporting
files, while MapReduce could use the checksummed version for intermediate
data.

S3 might use this, or might not, if we think that Amazon already provides
sufficient data integrity.

Thoughts?

Doug


Reply via email to