Colin Marc created HDFS-7554:
--------------------------------

             Summary: Checksumming is implementation specific
                 Key: HDFS-7554
                 URL: https://issues.apache.org/jira/browse/HDFS-7554
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: dfsclient
    Affects Versions: 2.5.0
            Reporter: Colin Marc
            Priority: Minor


The code that calculates checksums of files in DFSClient is implementation 
specific. That is to say, the checksums should be consistent constant as long 
as you use the same code, but the algorithm isn't particularly stable or 
portable.

In DFSClient.java, when each individual checksum is received for a block, those 
checksums are written out to a DataOutputBuffer: 
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L2173

Then the checksum is calculated by digesting all the data from that buffer: 
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L2231

However, that buffer is (reasonably) automatically padded with zeroes to the 
next power of two, those zeroes are included in the checksum.

This effectively means that the checksum algorithm is dependent on the behavior 
of DataOutputBuffer, which is a bit surprising, and could change in the future. 
It would be much more stable, not to mention memory efficient, if the final 
hash was simply updated with each block checksum, rather than buffering them 
all and then digesting that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to