[jira] [Comment Edited] (HADOOP-16158) DistCp supports checksum validation when copy blocks in parallel

Kai Xie (JIRA) Mon, 04 Mar 2019 05:57:10 -0800


    [ 
https://issues.apache.org/jira/browse/HADOOP-16158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16783384#comment-16783384
 ]


Kai Xie edited comment on HADOOP-16158 at 3/4/19 1:56 PM:
----------------------------------------------------------

Hi Steve, thanks for the comment.

`isSplit` here is introduced by HADOOP-11794 
([commit|https://github.com/apache/hadoop/commit/064c8b25eca9bc825dc07a54d9147d65c9290a03#diff-a3629647166ce008e67f0a93bc9c856bR265])
 and used to indicate if the source data to copy is only a chunk of it 
(consists of one or more blocks, not all).

The patch skipped the checksum validation in DistCp CopyMapper / 
RetriableFileCopyCommand because 
 # the copied target data is just a chunk / a few blocks of the source data. 
 # existing FileSystem API `getFileChecksum` can't operate at the block level.

And I have 2 options for the patch:
 # add the checksum validation in DistCp CopyCommitter after chunks are merged 
back to one 
([code|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyCommitter.java#L628]).
 But doing this will miss the chance to retry copying in the map phase if any 
checksum mismatch is detected.
 # add an API in FileSystem like `getFileChecksum(path, start, length)` and 
then we can use it in DistCp CopyMapper to validate the checksum between the 
source data and the copied blocks. But I'm not sure if such use case is strong 
enough to justify adding the new API


was (Author: kai33):
Hi Steve, thanks for the comment.

`isSplit` here is introduced by HADOOP-11794 
([commit|https://github.com/apache/hadoop/commit/064c8b25eca9bc825dc07a54d9147d65c9290a03#diff-a3629647166ce008e67f0a93bc9c856bR265])
 and used to indicate if the source data to copy is only a chunk of it 
(consists of one or more blocks, not all).

The patch skipped the checksum validation in DistCp CopyMapper / 
RetriableFileCopyCommand because 
 # the copied target data is just a chunk / a few blocks of the source data. 
 # existing FileSystem API `getFileChecksum` can't operate at the block level.

And I have 2 options for the patch:
 # add the checksum validation in DistCp CopyCommitter after chunks are merged 
back to one 
([code|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyCommitter.java#L628]).
 But doing this will miss the chance to retry copying in the map phase.
 # add an API in FileSystem like `getFileChecksum(path, start, length)` and 
then we can use it in DistCp CopyMapper to validate the checksum between the 
source data and the copied blocks. But I'm not sure if such use case is strong 
enough to justify adding the new API

> DistCp supports checksum validation when copy blocks in parallel
> ----------------------------------------------------------------
>
>                 Key: HADOOP-16158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16158
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 3.2.0, 2.9.2
>            Reporter: Kai Xie
>            Assignee: Kai Xie
>            Priority: Major
>
> Copying blocks in parallel (enabled when blocks per chunk > 0) is a great 
> DistCp improvement that can hugely speed up copying big files. 
> But its checksum validation is skipped, e.g. in 
> `RetriableFileCopyCommand.java`
>  
> {code:java}
> if (!source.isSplit()) {
>   compareCheckSums(sourceFS, source.getPath(), sourceChecksum,
>       targetFS, targetPath);
> }
> {code}
> and this could result in checksum/data mismatch without notifying 
> developers/users (e.g. HADOOP-16049).
> I'd like to provide a patch to add the checksum validation.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HADOOP-16158) DistCp supports checksum validation when copy blocks in parallel

Reply via email to