Colin Marc created HDFS-7554: -------------------------------- Summary: Checksumming is implementation specific Key: HDFS-7554 URL: https://issues.apache.org/jira/browse/HDFS-7554 Project: Hadoop HDFS Issue Type: Bug Components: dfsclient Affects Versions: 2.5.0 Reporter: Colin Marc Priority: Minor
The code that calculates checksums of files in DFSClient is implementation specific. That is to say, the checksums should be consistent constant as long as you use the same code, but the algorithm isn't particularly stable or portable. In DFSClient.java, when each individual checksum is received for a block, those checksums are written out to a DataOutputBuffer: https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L2173 Then the checksum is calculated by digesting all the data from that buffer: https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L2231 However, that buffer is (reasonably) automatically padded with zeroes to the next power of two, those zeroes are included in the checksum. This effectively means that the checksum algorithm is dependent on the behavior of DataOutputBuffer, which is a bit surprising, and could change in the future. It would be much more stable, not to mention memory efficient, if the final hash was simply updated with each block checksum, rather than buffering them all and then digesting that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)