[ 
https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628743#action_12628743
 ] 

Doug Cutting commented on HADOOP-3981:
--------------------------------------

> We use these 6.25MB second level CRCs as the checksum of the entire file.

Why not just use the MD5 or SHA1 of the CRCs?

When should we compute checksums? Are they computed on demand, when someone 
calls FileSystem#getFileChecksum()? Or are they pre-computed and stored? If 
they're not pre-computed then we certainly ought to compute them from the 
CRC's.  Even if they are to be pre-computed, then we might still use the CRCs, 
to reduce FileSystem upgrade time.

If checksums were pre-computed, where would they be stored?  We could store 
them in the NameNode, with file metadata, or we could store per-block checksums 
on datanodes.

My hunch is that we should compute them on demand from CRC data.  We extend 
ClientDatanodeProtocol to add a getChecksum() operation that returns the 
checksum for a block without transmitting the CRCs to the client, and the 
client combines block checksums to get a whole-file checksum.  This is rather 
expensive, but still a lot faster than checksumming the entire file on demand.  
DistCp would be substantially faster if it only used checksums when file 
lengths match, so we should probably make that optimization.

Longer-term we could think about a checksum API that permits a sequence of 
checksums to be returned per file, so that, e.g., if a source file has been 
appended to, we could truncate the destination and append the new data, 
incrementally updating it.  But until HDFS supports truncation this is moot.

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading 
> the entire input message sequentially in a central location.  HDFS supports 
> large files with multiple tera bytes.  The overhead of reading the entire 
> file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to