[ 
https://issues.apache.org/jira/browse/HDFS-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15156495#comment-15156495
 ] 

Kai Zheng commented on HDFS-8430:
---------------------------------

Thanks [~rakeshr] for looking into this. My pleasure to do the clarifying.
bq. Does the proposal use pre-computed block checksum present in the block 
metadata instead of re-calculating it again?
Yes. When a block in the block group is fine, its pre-computed block checksum 
data will be used in any case. But in the case the block is missed or 
corrupted, the needed block checksum data will need to be re-computed.
bq. Could you tell me the finalized/agreed approach for the checksum 
computation. I could see two approaches for a striped file ...
Sure. Discussed and confirmed by Nicholas as above, we will make the current 
existing DFSClient#getFileChecksum() work for striped files as well meanwhile 
coming up a new getFileChecksum(algorithm) API for aligning striped file with 
replicated file. The former will be done in HDFS-9694 and use the approach 2 in 
your sense, basically it will aggregate file checksum by block group checksums 
that's aggregated from block checksums one by one from block0 through to block8 
in the group. The later will be done in striping awareness sense in the 
approach 1 in your thinking. Both will compute block group checksum in datanode 
side, where when a block is missed/corrupted, it will recompute the block 
checksum on the fly as mentioned in HDFS-9833.
bq. How do the new algorithm applys to a contiguous file? Does it split the 
file into smaller cells and each cell size could be of 64KB size?
Right exactly, to align with the target striped file.
bq. I hope you are planning to use the current MD5MD5CRC32, isn't it?
Yeah, right now we will still use the algorithm.

> Erasure coding: compute file checksum for stripe files
> ------------------------------------------------------
>
>                 Key: HDFS-8430
>                 URL: https://issues.apache.org/jira/browse/HDFS-8430
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: HDFS-7285
>            Reporter: Walter Su
>            Assignee: Kai Zheng
>         Attachments: HDFS-8430-poc1.patch
>
>
> HADOOP-3981 introduces a  distributed file checksum algorithm. It's designed 
> for replicated block.
> {{DFSClient.getFileChecksum()}} need some updates, so it can work for striped 
> block group.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to