[ 
https://issues.apache.org/jira/browse/HDFS-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105005#comment-15105005
 ] 

Kai Zheng commented on HDFS-8430:
---------------------------------

Status update

FileSystem:
* Added new API {{getFileChecksum(String algorithm)}} similar to the existing 
old API {{getFileChecksum}}, for a file for data all or a range.
* Added new API {{supportChecksumAlgorithm(String algorithm)}}.

Data transfer protocol:
* Added a new protocol method {{blockGroupChecksum(StripedBlockInfo 
blockGroupInfo, int mode, BlockToken token)}} to calculate the MD5 aggregation 
result for a striping block group in DataNode side, for both old and new APIs.
* Mode 1 for old API, simply summing all the block checksum data in the group 
one by one as they're replicated blocks
* Mode 2 for new API, dividing and summing all the block checksum data in 
striping/cell sense.
* In both modes, in case data blocks missed, on demand recovering the blocks 
and recomputing the block checksum data. No stored and discarded after used. 
Recovering logic shares the existing codes in {{ErasureCodingWorker}} as 
possible via refactoring.
* Added a new protocol method {{rawBlockChecksum()}} to retrieve the whole raw 
block checksum or CRC32 data. For simple, getting all the data in a pass, to 
consider multiple passes. This is for the new API because a block group 
checksum computer needs to retrieve all the block checksum data in the group to 
the place so able to reorganize in data strips and compute block group checksum 
as contiguous blocks do.

In client side:
* Introduced {{ReplicatedFileChecksumComputer1}}, 
{{ReplicatedFileChecksumComputer2}}, {{StripedFileChecksumComputer1}} and 
{{StripedFileChecksumComputer2}}, these sharing codes as possible and 
refactoring related client side codes.
* ReplicatedFileChecksumComputer1 for the old API and replicated files, 
refactoring and using existing logics.
* ReplicatedFileChecksumComputer2 for the new API and replicated files, similar 
to ReplicatedFileChecksumComputer1 but with awareness of cell. The block in its 
question should be exactly divided by the cell size. Otherwise, cell64k like 
algorithm not supported exception.
* StripedFileChecksumComputer1 for the old API, summing all the block group 
checksum data together, for each block group, calling blockGroupChecksum using 
mode 1.
* StripedFileChecksumComputer2 for the new API, summing all the block group 
checksum data together, for each block group, calling blockGroupChecksum using 
mode 2.

In datanode side:
* Introduced {{BlockChecksumComputer}}, {{BlockGroupChecksumComputer1}} and 
{{BlockGroupChecksumComputer2}}, these sharing codes as possible and 
refactoring related DataNode side codes.
* BlockChecksumComputer for the old API and replicated blocks, refactoring and 
using existing logics.
* BlockGroupChecksumComputer1 for the old API, summing all the block checksum 
data together in the group, for each block, calling existing 
{{blockChecksum()}} method in the data transfer protocol.
* BlockGroupChecksumComputer2 for the new API, summing all the strip checksum 
data together in the group, for each block, calling the new method 
{{rawBlockChecksum()}} in the data transfer protocol.

DistCp
* TODO, will use the two added new APIs to checksum and compare for the source 
and target files.

The codes are still messy, and leave many blanks. Will attach a large patch for 
taking a look when the two APIs are able to work as expected. Seems to break 
down. The function is small, but gets big when implements. Very possibly missed 
some points, thanks for comments and suggestions, as always.

> Erasure coding: compute file checksum for stripe files
> ------------------------------------------------------
>
>                 Key: HDFS-8430
>                 URL: https://issues.apache.org/jira/browse/HDFS-8430
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: HDFS-7285
>            Reporter: Walter Su
>            Assignee: Kai Zheng
>         Attachments: HDFS-8430-poc1.patch
>
>
> HADOOP-3981 introduces a  distributed file checksum algorithm. It's designed 
> for replicated block.
> {{DFSClient.getFileChecksum()}} need some updates, so it can work for striped 
> block group.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to