[jira] [Updated] (HDFS-7554) Checksumming is implementation specific

Colin Marc (JIRA) Fri, 19 Dec 2014 09:00:48 -0800

     [ 
https://issues.apache.org/jira/browse/HDFS-7554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Colin Marc updated HDFS-7554:
-----------------------------
    Description: 
The code that calculates checksums of files in DFSClient is implementation 
specific. That is to say, the checksums should remain constant as long as you 
use the same code, but the algorithm isn't particularly stable or portable.

In DFSClient.java, when each individual checksum is received for a block, those 
checksums are written out to a DataOutputBuffer: 
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L2173

Then the checksum is calculated by digesting all the data from that buffer: 
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L2231

However, that buffer is (reasonably) automatically padded with zeroes to the 
next power of two, those zeroes are included in the checksum.

This effectively means that the checksum algorithm is dependent on the behavior 
of DataOutputBuffer, which is a bit surprising, and could change in the future. 
It would be much more stable, not to mention memory efficient, if the final 
hash was simply updated with each block checksum, rather than buffering them 
all and then digesting that.

  was:
The code that calculates checksums of files in DFSClient is implementation 
specific. That is to say, the checksums should be consistent constant as long 
as you use the same code, but the algorithm isn't particularly stable or 
portable.

In DFSClient.java, when each individual checksum is received for a block, those 
checksums are written out to a DataOutputBuffer: 
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L2173

Then the checksum is calculated by digesting all the data from that buffer: 
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L2231

However, that buffer is (reasonably) automatically padded with zeroes to the 
next power of two, those zeroes are included in the checksum.

This effectively means that the checksum algorithm is dependent on the behavior 
of DataOutputBuffer, which is a bit surprising, and could change in the future. 
It would be much more stable, not to mention memory efficient, if the final 
hash was simply updated with each block checksum, rather than buffering them 
all and then digesting that.


> Checksumming is implementation specific
> ---------------------------------------
>
>                 Key: HDFS-7554
>                 URL: https://issues.apache.org/jira/browse/HDFS-7554
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: dfsclient
>    Affects Versions: 2.5.0
>            Reporter: Colin Marc
>            Priority: Minor
>
> The code that calculates checksums of files in DFSClient is implementation 
> specific. That is to say, the checksums should remain constant as long as you 
> use the same code, but the algorithm isn't particularly stable or portable.
> In DFSClient.java, when each individual checksum is received for a block, 
> those checksums are written out to a DataOutputBuffer: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L2173
> Then the checksum is calculated by digesting all the data from that buffer: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L2231
> However, that buffer is (reasonably) automatically padded with zeroes to the 
> next power of two, those zeroes are included in the checksum.
> This effectively means that the checksum algorithm is dependent on the 
> behavior of DataOutputBuffer, which is a bit surprising, and could change in 
> the future. It would be much more stable, not to mention memory efficient, if 
> the final hash was simply updated with each block checksum, rather than 
> buffering them all and then digesting that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7554) Checksumming is implementation specific

Reply via email to