[ https://issues.apache.org/jira/browse/HDFS-4605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13606039#comment-13606039 ]
Tsz Wo (Nicholas), SZE commented on HDFS-4605: ---------------------------------------------- > Most hashing is incremental, so if DFSClient feeds the last state of hash > into the next datanode and let it continue updating it, the result will be > independent of block size. ... However, the above computation is a sequential computation. I think it would take a very long time for large files. How about we make an assumption that block size is a multiple of a small number, say 1 MB? Then, each datanode computes 1-MB checksums (over CRC32s) in parallel. At last, DFSClient combines all the 1-MB checksums. > Implement block-size independent file checksum > ---------------------------------------------- > > Key: HDFS-4605 > URL: https://issues.apache.org/jira/browse/HDFS-4605 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs-client > Affects Versions: 3.0.0 > Reporter: Kihwal Lee > > The value of current getFileChecksum() is block-size dependent. Since > FileChecksum is mainly intended for comparing content of files, removing this > dependency will make FileCheckum in HDFS relevant in more use cases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira