[ 
https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628817#action_12628817
 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3981:
------------------------------------------------

bq. When should we compute checksums? Are they computed on demand, when someone 
calls FileSystem#getFileChecksum()? Or are they pre-computed and stored? If 
they're not pre-computed then we certainly ought to compute them from the 
CRC's. Even if they are to be pre-computed, then we might still use the CRCs, 
to reduce FileSystem upgrade time.

It is better to compute file checksum on-demand, so that the Datanode storage 
layout remains unchanged and we won't have to do distributed upgrade.

bq. My hunch is that we should compute them on demand from CRC data. We extend 
ClientDatanodeProtocol to add a getChecksum() operation that returns the 
checksum for a block without transmitting the CRCs to the client, and the 
client combines block checksums to get a whole-file checksum. This is rather 
expensive, but still a lot faster than checksumming the entire file on demand.

My idea is similar to this except that we should not compute block checksum.  
Otherwise the file checksum computed depends on the block size.  That is the 
reason that I propose to compute the second level CRCs over the first level 
CRCs.  This idea is borrowed from hash tree (aka Merkle Trees), which is used 
by ZFS.


> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading 
> the entire input message sequentially in a central location.  HDFS supports 
> large files with multiple tera bytes.  The overhead of reading the entire 
> file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to