[ 
https://issues.apache.org/jira/browse/HDFS-4605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13606039#comment-13606039
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-4605:
----------------------------------------------

> Most hashing is incremental, so if DFSClient feeds the last state of hash 
> into the next datanode and let it continue updating it, the result will be 
> independent of block size. ...

However, the above computation is a sequential computation.  I think it would 
take a very long time for large files.

How about we make an assumption that block size is a multiple of a small 
number, say 1 MB?  Then, each datanode computes 1-MB checksums (over CRC32s) in 
parallel.  At last, DFSClient combines all the 1-MB checksums.
                
> Implement block-size independent file checksum
> ----------------------------------------------
>
>                 Key: HDFS-4605
>                 URL: https://issues.apache.org/jira/browse/HDFS-4605
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs-client
>    Affects Versions: 3.0.0
>            Reporter: Kihwal Lee
>
> The value of current getFileChecksum() is block-size dependent. Since 
> FileChecksum is mainly intended for comparing content of files, removing this 
> dependency will make FileCheckum in HDFS relevant in more use cases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to