[ 
https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629217#action_12629217
 ] 

Doug Cutting commented on HADOOP-3981:
--------------------------------------

> Otherwise the file checksum computed depends on the block size.

It still depends on bytes.per.checksum, which can vary per file, just like 
block size.  If two files have different bytes.per.checksum then we should not 
compare CRC-derived checksums.  Perhaps we can use bytes.per.checksum in the 
algorithm name, e.g., MD5-of-CRC32-every-512bytes could be an algorithm name.  
If we compute these per-block, then the algorithm name would be 
MD5-of-CRC32-every-512bytes-with-64Mblocks.

If we compute checksums on demand from CRCs then it will be relatively slow.  
Distcp thus needs to be sure to only get checksums when lengths match and the 
alternative is copying the entire file.  So long as distcp is the primary 
client of checksums this is probably sufficient and we should not bother 
storing checksums.

Another API to consider might be:
  - String[] getChecksumAlgorithms(Path)
  - Checksum getChecksum(Path)

This way an HDFS filesystem might return 
["MD5-of-CRC32-every-512bytes-with-64Mblocks", "MD5-of-CRC32-every-512bytes", 
"MD5"] the possible algorithms for a file in preferred order.  Then Distcp 
could call this for two files (whose lengths match) to see if they have any 
compatible algorithms.  If possible, CRC's would be combined on datanodes, but, 
if block sizes differ, the CRCs could be summed in the client.  If the CRCs are 
incompatible, then MD5s could be computed on datanodes.  Is this overkill?  
Probably.


> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading 
> the entire input message sequentially in a central location.  HDFS supports 
> large files with multiple tera bytes.  The overhead of reading the entire 
> file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to