[
https://issues.apache.org/jira/browse/HDFS-7554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Colin Marc updated HDFS-7554:
-----------------------------
Description:
The code that calculates checksums of files in DFSClient is implementation
specific. That is to say, the checksums should remain constant as long as you
use the same code, but the algorithm isn't particularly stable or portable.
In DFSClient.java, when each individual checksum is received for a block, those
checksums are written out to a DataOutputBuffer:
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L2173
Then the checksum is calculated by digesting all the data from that buffer:
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L2231
However, that buffer is (reasonably) automatically padded with zeroes to the
next power of two, those zeroes are included in the checksum.
This effectively means that the checksum algorithm is dependent on the behavior
of DataOutputBuffer, which is a bit surprising, and could change in the future.
It would be much more stable, not to mention memory efficient, if the final
hash was simply updated with each block checksum, rather than buffering them
all and then digesting that.
was:
The code that calculates checksums of files in DFSClient is implementation
specific. That is to say, the checksums should be consistent constant as long
as you use the same code, but the algorithm isn't particularly stable or
portable.
In DFSClient.java, when each individual checksum is received for a block, those
checksums are written out to a DataOutputBuffer:
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L2173
Then the checksum is calculated by digesting all the data from that buffer:
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L2231
However, that buffer is (reasonably) automatically padded with zeroes to the
next power of two, those zeroes are included in the checksum.
This effectively means that the checksum algorithm is dependent on the behavior
of DataOutputBuffer, which is a bit surprising, and could change in the future.
It would be much more stable, not to mention memory efficient, if the final
hash was simply updated with each block checksum, rather than buffering them
all and then digesting that.
> Checksumming is implementation specific
> ---------------------------------------
>
> Key: HDFS-7554
> URL: https://issues.apache.org/jira/browse/HDFS-7554
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: dfsclient
> Affects Versions: 2.5.0
> Reporter: Colin Marc
> Priority: Minor
>
> The code that calculates checksums of files in DFSClient is implementation
> specific. That is to say, the checksums should remain constant as long as you
> use the same code, but the algorithm isn't particularly stable or portable.
> In DFSClient.java, when each individual checksum is received for a block,
> those checksums are written out to a DataOutputBuffer:
> https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L2173
> Then the checksum is calculated by digesting all the data from that buffer:
> https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L2231
> However, that buffer is (reasonably) automatically padded with zeroes to the
> next power of two, those zeroes are included in the checksum.
> This effectively means that the checksum algorithm is dependent on the
> behavior of DataOutputBuffer, which is a bit surprising, and could change in
> the future. It would be much more stable, not to mention memory efficient, if
> the final hash was simply updated with each block checksum, rather than
> buffering them all and then digesting that.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)