[ 
https://issues.apache.org/jira/browse/HDFS-4605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603509#comment-13603509
 ] 

Kihwal Lee commented on HDFS-4605:
----------------------------------

>From MAPREDUCE-5065,

bq. Most hashing is incremental, so if DFSClient feeds the last state of hash 
into the next datanode and let it continue updating it, the result will be 
independent of block size. The current way of doing file checksum allows 
calculating individual block checksums in parallel, but we are not taking 
advantage of it in DFSClient anyway. So I don't think there will be any 
significant changes in performance or overhead.

I think this will work as long as 
* no partial blocks in the middle.
* block size is multiple of crc chunk/block size.
As far as I know these are enforced in HDFS.

Assuming this can be done, what will be the best way to add this feature? 
                
> Implement block-size independent file checksum
> ----------------------------------------------
>
>                 Key: HDFS-4605
>                 URL: https://issues.apache.org/jira/browse/HDFS-4605
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs-client
>    Affects Versions: 3.0.0
>            Reporter: Kihwal Lee
>
> The value of current getFileChecksum() is block-size dependent. Since 
> FileChecksum is mainly intended for comparing content of files, removing this 
> dependency will make FileCheckum in HDFS relevant in more use cases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to