[ 
https://issues.apache.org/jira/browse/HADOOP-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754962#comment-16754962
 ] 

Steve Loughran commented on HADOOP-16083:
-----------------------------------------

So what you are saying is if CRC checking is enabled, (i.e. you dont do an 
update with -skipCrcCheck), it overwrites all files?

Because with the CRC check disabled, I thought it was simpler than that
* files where lengths are different: update
* files where source is missing delete (if some other option is enabled)
* files where source length == dest.length then skip overwrite

Now, if the dest is a filesystem without checksums, the update downgrades to 
assuming you'd requested crc were skipped (this has caused problems with adding 
CRC checks to S3a (HADOOP-13232): all existing workflows and tests broke. 

Which is why we have to be very careful  about any changes here. All workflows, 
including those invoked internally by Hive, called from OOzie, etc work, with 
sources and destinations other than just HDFS -> HDFS. 

h3. If I explicitly copy a file from HDFS to S3a, even without -skipCRCCheck, I 
expect the file to be copied. As happens today.

# you'll have to talk to people who use distcp here. At the very least, this 
must only happen when source and dest are using checksums and the checksums are 
equal.
# The new tests will need to go into AbstractContractDistCpTest



> DistCp shouldn't always overwrite the target file when checksums match
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-16083
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16083
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 3.2.0, 3.1.1, 3.3.0
>            Reporter: Siyao Meng
>            Assignee: Siyao Meng
>            Priority: Major
>         Attachments: HADOOP-16083.001.patch
>
>
> {code:java|title=CopyMapper#setup}
> ...
>     try {
>       overWrite = overWrite || 
> targetFS.getFileStatus(targetFinalPath).isFile();
>     } catch (FileNotFoundException ignored) {
>     }
> ...
> {code}
> The above code overrides config key "overWrite" to "true" when the target 
> path is a file. Therefore, unnecessary transfer happens when the source and 
> target file have the same checksums.
> My suggestion is: remove the code above. If the user insists to overwrite, 
> just add -overwrite in the options:
> {code:bash|title=DistCp command with -overwrite option}
> hadoop distcp -overwrite hdfs://localhost:64464/source/5/6.txt 
> hdfs://localhost:64464/target/5/6.txt
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to