[
https://issues.apache.org/jira/browse/HDFS-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105005#comment-15105005
]
Kai Zheng commented on HDFS-8430:
---------------------------------
Status update
FileSystem:
* Added new API {{getFileChecksum(String algorithm)}} similar to the existing
old API {{getFileChecksum}}, for a file for data all or a range.
* Added new API {{supportChecksumAlgorithm(String algorithm)}}.
Data transfer protocol:
* Added a new protocol method {{blockGroupChecksum(StripedBlockInfo
blockGroupInfo, int mode, BlockToken token)}} to calculate the MD5 aggregation
result for a striping block group in DataNode side, for both old and new APIs.
* Mode 1 for old API, simply summing all the block checksum data in the group
one by one as they're replicated blocks
* Mode 2 for new API, dividing and summing all the block checksum data in
striping/cell sense.
* In both modes, in case data blocks missed, on demand recovering the blocks
and recomputing the block checksum data. No stored and discarded after used.
Recovering logic shares the existing codes in {{ErasureCodingWorker}} as
possible via refactoring.
* Added a new protocol method {{rawBlockChecksum()}} to retrieve the whole raw
block checksum or CRC32 data. For simple, getting all the data in a pass, to
consider multiple passes. This is for the new API because a block group
checksum computer needs to retrieve all the block checksum data in the group to
the place so able to reorganize in data strips and compute block group checksum
as contiguous blocks do.
In client side:
* Introduced {{ReplicatedFileChecksumComputer1}},
{{ReplicatedFileChecksumComputer2}}, {{StripedFileChecksumComputer1}} and
{{StripedFileChecksumComputer2}}, these sharing codes as possible and
refactoring related client side codes.
* ReplicatedFileChecksumComputer1 for the old API and replicated files,
refactoring and using existing logics.
* ReplicatedFileChecksumComputer2 for the new API and replicated files, similar
to ReplicatedFileChecksumComputer1 but with awareness of cell. The block in its
question should be exactly divided by the cell size. Otherwise, cell64k like
algorithm not supported exception.
* StripedFileChecksumComputer1 for the old API, summing all the block group
checksum data together, for each block group, calling blockGroupChecksum using
mode 1.
* StripedFileChecksumComputer2 for the new API, summing all the block group
checksum data together, for each block group, calling blockGroupChecksum using
mode 2.
In datanode side:
* Introduced {{BlockChecksumComputer}}, {{BlockGroupChecksumComputer1}} and
{{BlockGroupChecksumComputer2}}, these sharing codes as possible and
refactoring related DataNode side codes.
* BlockChecksumComputer for the old API and replicated blocks, refactoring and
using existing logics.
* BlockGroupChecksumComputer1 for the old API, summing all the block checksum
data together in the group, for each block, calling existing
{{blockChecksum()}} method in the data transfer protocol.
* BlockGroupChecksumComputer2 for the new API, summing all the strip checksum
data together in the group, for each block, calling the new method
{{rawBlockChecksum()}} in the data transfer protocol.
DistCp
* TODO, will use the two added new APIs to checksum and compare for the source
and target files.
The codes are still messy, and leave many blanks. Will attach a large patch for
taking a look when the two APIs are able to work as expected. Seems to break
down. The function is small, but gets big when implements. Very possibly missed
some points, thanks for comments and suggestions, as always.
> Erasure coding: compute file checksum for stripe files
> ------------------------------------------------------
>
> Key: HDFS-8430
> URL: https://issues.apache.org/jira/browse/HDFS-8430
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Affects Versions: HDFS-7285
> Reporter: Walter Su
> Assignee: Kai Zheng
> Attachments: HDFS-8430-poc1.patch
>
>
> HADOOP-3981 introduces a distributed file checksum algorithm. It's designed
> for replicated block.
> {{DFSClient.getFileChecksum()}} need some updates, so it can work for striped
> block group.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)