[
https://issues.apache.org/jira/browse/HDFS-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15080604#comment-15080604
]
Kai Zheng commented on HDFS-8430:
---------------------------------
Thanks [~szetszwo] for the ideas! It looks like it's acceptable the file
checksum for striped files are not compatible or comparable with replicated
files. It sounds not bad since we may seldom compare striped files with
replicated files, or if we do, they will surely differ since their block
layouts are different at all. So in this direction, I guess things could be
simpler since we can consider different algorithms for striped files as you
said and we could avoid the increased network traffic.
bq. Or simply compute cell checksums for replicated files instead of block
checksums.
I guess you mean *stripped files*? In this thinking for the {{simple}}, Would
you think it works or not if we do as illustrated in details as follows?
Assumes a block group of blocks from {{b0}} to {{b5}}, and of {{n+1}} strips or
rows. The 1st strip is of cells from {{c00}} to {{c05}} and so on. For the 1st
column, the cells from {{c00}}, {{c10}} to {{cn0}} reside on block {{b0}}, and
so on for other columns.
{noformat}
b0 b1 b2 b3 b4 b5
c00 c01 c02 c03 c04 c05
c10 c11 c12 c13 c14 c15
...
cn0 cn1 cn2 cn3 cn4 cn5
{noformat}
Similar to the block MD5 algorithm for replicated files in
{{DataXceiver#blockChecksum}}, we could compute the checksum result for block
{{b0}} by aggregating the MD5 (or other algorithm) hash results for the located
cells (c00, c10, ..., cn0).
> Erasure coding: update DFSClient.getFileChecksum() logic for stripe files
> -------------------------------------------------------------------------
>
> Key: HDFS-8430
> URL: https://issues.apache.org/jira/browse/HDFS-8430
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Affects Versions: HDFS-7285
> Reporter: Walter Su
> Assignee: Kai Zheng
> Attachments: HDFS-8430-poc1.patch
>
>
> HADOOP-3981 introduces a distributed file checksum algorithm. It's designed
> for replicated block.
> {{DFSClient.getFileChecksum()}} need some updates, so it can work for striped
> block group.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)