[
https://issues.apache.org/jira/browse/HADOOP-16158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16783384#comment-16783384
]
Kai Xie edited comment on HADOOP-16158 at 3/4/19 1:56 PM:
----------------------------------------------------------
Hi Steve, thanks for the comment.
`isSplit` here is introduced by HADOOP-11794
([commit|https://github.com/apache/hadoop/commit/064c8b25eca9bc825dc07a54d9147d65c9290a03#diff-a3629647166ce008e67f0a93bc9c856bR265])
and used to indicate if the source data to copy is only a chunk of it
(consists of one or more blocks, not all).
The patch skipped the checksum validation in DistCp CopyMapper /
RetriableFileCopyCommand because
# the copied target data is just a chunk / a few blocks of the source data.
# existing FileSystem API `getFileChecksum` can't operate at the block level.
And I have 2 options for the patch:
# add the checksum validation in DistCp CopyCommitter after chunks are merged
back to one
([code|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyCommitter.java#L628]).
But doing this will miss the chance to retry copying in the map phase if any
checksum mismatch is detected.
# add an API in FileSystem like `getFileChecksum(path, start, length)` and
then we can use it in DistCp CopyMapper to validate the checksum between the
source data and the copied blocks. But I'm not sure if such use case is strong
enough to justify adding the new API
was (Author: kai33):
Hi Steve, thanks for the comment.
`isSplit` here is introduced by HADOOP-11794
([commit|https://github.com/apache/hadoop/commit/064c8b25eca9bc825dc07a54d9147d65c9290a03#diff-a3629647166ce008e67f0a93bc9c856bR265])
and used to indicate if the source data to copy is only a chunk of it
(consists of one or more blocks, not all).
The patch skipped the checksum validation in DistCp CopyMapper /
RetriableFileCopyCommand because
# the copied target data is just a chunk / a few blocks of the source data.
# existing FileSystem API `getFileChecksum` can't operate at the block level.
And I have 2 options for the patch:
# add the checksum validation in DistCp CopyCommitter after chunks are merged
back to one
([code|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyCommitter.java#L628]).
But doing this will miss the chance to retry copying in the map phase.
# add an API in FileSystem like `getFileChecksum(path, start, length)` and
then we can use it in DistCp CopyMapper to validate the checksum between the
source data and the copied blocks. But I'm not sure if such use case is strong
enough to justify adding the new API
> DistCp supports checksum validation when copy blocks in parallel
> ----------------------------------------------------------------
>
> Key: HADOOP-16158
> URL: https://issues.apache.org/jira/browse/HADOOP-16158
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Affects Versions: 3.2.0, 2.9.2
> Reporter: Kai Xie
> Assignee: Kai Xie
> Priority: Major
>
> Copying blocks in parallel (enabled when blocks per chunk > 0) is a great
> DistCp improvement that can hugely speed up copying big files.
> But its checksum validation is skipped, e.g. in
> `RetriableFileCopyCommand.java`
>
> {code:java}
> if (!source.isSplit()) {
> compareCheckSums(sourceFS, source.getPath(), sourceChecksum,
> targetFS, targetPath);
> }
> {code}
> and this could result in checksum/data mismatch without notifying
> developers/users (e.g. HADOOP-16049).
> I'd like to provide a patch to add the checksum validation.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]