[ 
https://issues.apache.org/jira/browse/HDFS-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15080604#comment-15080604
 ] 

Kai Zheng commented on HDFS-8430:
---------------------------------

Thanks [~szetszwo] for the ideas! It looks like it's acceptable the file 
checksum for striped files are not compatible or comparable with replicated 
files. It sounds not bad since we may seldom compare striped files with 
replicated files, or if we do, they will surely differ since their block 
layouts are different at all. So in this direction, I guess things could be 
simpler since we can consider different algorithms for striped files as you 
said and we could avoid the increased network traffic. 
bq. Or simply compute cell checksums for replicated files instead of block 
checksums.
I guess you mean *stripped files*? In this thinking for the {{simple}}, Would 
you think it works or not if we do as illustrated in details as follows?

Assumes a block group of blocks from {{b0}} to {{b5}}, and of {{n+1}} strips or 
rows. The 1st strip is of cells from {{c00}} to {{c05}} and so on. For the 1st 
column, the cells from {{c00}}, {{c10}} to {{cn0}} reside on block {{b0}}, and 
so on for other columns.
{noformat}
b0    b1    b2   b3   b4   b5
c00  c01  c02  c03  c04  c05
c10  c11  c12  c13  c14  c15
...
cn0  cn1  cn2  cn3  cn4  cn5
{noformat}
Similar to the block MD5 algorithm for replicated files in 
{{DataXceiver#blockChecksum}}, we could compute the checksum result for block 
{{b0}} by aggregating the MD5 (or other algorithm) hash results for the located 
cells (c00, c10, ..., cn0).

> Erasure coding: update DFSClient.getFileChecksum() logic for stripe files
> -------------------------------------------------------------------------
>
>                 Key: HDFS-8430
>                 URL: https://issues.apache.org/jira/browse/HDFS-8430
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: HDFS-7285
>            Reporter: Walter Su
>            Assignee: Kai Zheng
>         Attachments: HDFS-8430-poc1.patch
>
>
> HADOOP-3981 introduces a  distributed file checksum algorithm. It's designed 
> for replicated block.
> {{DFSClient.getFileChecksum()}} need some updates, so it can work for striped 
> block group.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to