[ https://issues.apache.org/jira/browse/HADOOP-13023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438921#comment-16438921 ]
Rohit Pegallapati edited comment on HADOOP-13023 at 4/16/18 2:18 AM: --------------------------------------------------------------------- This looks inline with the intended behavior of -update option [https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html] {code:java} -update is used to copy files from source that don’t exist at the target or differ from the target version. -overwrite overwrites target-files that exist at the target. The Update and Overwrite options warrant special attention since their handling of source-paths varies from the defaults in a very subtle manner. Consider a copy from /source/first/ and /source/second/ to /target/, where the source paths have the following contents: hdfs://nn1:8020/source/first/1 hdfs://nn1:8020/source/first/2 hdfs://nn1:8020/source/second/10 hdfs://nn1:8020/source/second/20 When DistCp is invoked without -update or -overwrite, the DistCp defaults would create directories first/ and second/, under /target. Thus: distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target would yield the following contents in /target: hdfs://nn2:8020/target/first/1 hdfs://nn2:8020/target/first/2 hdfs://nn2:8020/target/second/10 hdfs://nn2:8020/target/second/20 When either -update or -overwrite is specified, the *contents* of the source-directories are copied to target, and not the source directories themselves. Thus: distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target would yield the following contents in /target: hdfs://nn2:8020/target/1 hdfs://nn2:8020/target/2 hdfs://nn2:8020/target/10 hdfs://nn2:8020/target/20 {code} was (Author: rohit.peg): This looks inline with the intended behavior of -update option [https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html] {code} {{-update}} is used to copy files from source that don’t exist at the target or differ from the target version. {{-overwrite}} overwrites target-files that exist at the target. The Update and Overwrite options warrant special attention since their handling of source-paths varies from the defaults in a very subtle manner. Consider a copy from {{/source/first/}} and {{/source/second/}} to {{/target/}}, where the source paths have the following contents: hdfs://nn1:8020/source/first/1 hdfs://nn1:8020/source/first/2 hdfs://nn1:8020/source/second/10 hdfs://nn1:8020/source/second/20 When DistCp is invoked without {{-update}} or {{-overwrite}}, the DistCp defaults would create directories {{first/}} and {{second/}}, under {{/target}}. Thus: distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target would yield the following contents in {{/target}}: hdfs://nn2:8020/target/first/1 hdfs://nn2:8020/target/first/2 hdfs://nn2:8020/target/second/10 hdfs://nn2:8020/target/second/20 When either {{-update}} or {{-overwrite}} is specified, the *contents* of the source-directories are copied to target, and not the source directories themselves. Thus: distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target {code} > Distcp with -update feature on first time raw data not working > -------------------------------------------------------------- > > Key: HADOOP-13023 > URL: https://issues.apache.org/jira/browse/HADOOP-13023 > Project: Hadoop Common > Issue Type: Bug > Components: tools/distcp > Affects Versions: 2.6.0 > Reporter: Mavin Martin > Priority: Major > > When attempting to do a distcp with the -update feature toggled on encrypted > data, the distcp shows as successful. Reading the encrypted file on the > target_path does not work since the keyName does not exist. > Please see my example to reproduce the issue. > {code} > [root@xxx bin]# hdfs crypto -listZones > /tmp/a/ted DEF0000000000013 > [root@xxx bin]# hdfs dfs -ls -R /tmp > drwxr-xr-x - xxx xxx 0 2016-04-14 00:22 /tmp/a > drwxr-xr-x - xxx xxx 0 2016-04-14 00:00 /tmp/a/ted > -rw-r--r-- 3 xxx xxx 33 2016-04-14 00:00 /tmp/a/ted/test.txt > [root@xxx bin]# hadoop distcp -update /.reserved/raw/tmp/a/ted > /.reserved/raw/tmp/a-with-update/ted > [root@xxx bin]# hdfs crypto -listZones > /tmp/a/ted DEF0000000000013 > [root@xxx bin]# hadoop distcp /.reserved/raw/tmp/a/ted > /.reserved/raw/tmp/a-no-update/ted > [root@xxx bin]# hdfs crypto -listZones > /tmp/a/ted DEF0000000000013 > /tmp/a-no-update/ted DEF0000000000013 > {code} > The crypto zone for 'a-with-update' should have been created since this is a > new destination. You can verify this by looking at 'a-no-update'. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org