[
https://issues.apache.org/jira/browse/HADOOP-13023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438921#comment-16438921
]
Rohit Pegallapati edited comment on HADOOP-13023 at 4/16/18 2:35 AM:
---------------------------------------------------------------------
This looks inline with the intended behavior of -update option
[https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html]
{code:java}
-update is used to copy files from source that don’t exist at the target or
differ from the target version. -overwrite overwrites target-files that exist
at the target.
The Update and Overwrite options warrant special attention since their handling
of source-paths varies from the defaults in a very subtle manner. Consider a
copy from /source/first/ and /source/second/ to /target/, where the source
paths have the following contents:
hdfs://nn1:8020/source/first/1
hdfs://nn1:8020/source/first/2
hdfs://nn1:8020/source/second/10
hdfs://nn1:8020/source/second/20
When DistCp is invoked without -update or -overwrite, the DistCp defaults would
create directories first/ and second/, under /target. Thus:
distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second
hdfs://nn2:8020/target
would yield the following contents in /target:
hdfs://nn2:8020/target/first/1
hdfs://nn2:8020/target/first/2
hdfs://nn2:8020/target/second/10
hdfs://nn2:8020/target/second/20
When either -update or -overwrite is specified, the *contents* of the
source-directories are copied to target, and not the source directories
themselves.
Thus:
distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second
hdfs://nn2:8020/target
would yield the following contents in /target:
hdfs://nn2:8020/target/1
hdfs://nn2:8020/target/2
hdfs://nn2:8020/target/10
hdfs://nn2:8020/target/20
{code}
Performed a small test with encryption zone to validate the above point
{code:java}
Path sourcePath = new Path(dfs.getWorkingDirectory(), "source");
initData10(sourcePath);
Path foo = new Path("/source/foo");
dfs.mkdirs(foo);
dfs.createEncryptionZone(foo, "test");
String[] args = new String[]
{"-update","/.reserved/raw"+source.toString(),
"/.reserved/raw"+target.toString() };
new DistCp(conf, OptionsParser.parse(args)).execute();
RemoteIterator<EncryptionZone> listEncryptionZones =
dfs.listEncryptionZones();
while (listEncryptionZones.hasNext()) {
System.out.println("Encryption Zone :: " +
listEncryptionZones.next().getPath());
}
{code}
This above code prints 2 encryption zones as I create the encryption zone on
"foo" a subdirectory of the source directory. Here we can observe that the
encryption zone of the subdirectory is preserved at the target
{code:java}
Encryption Zone :: /source/foo
Encryption Zone :: /target/foo
{code}
On the other hand, the below code only prints one encryption zone as the
encryption zone is created directly on the source directory and not the
subdirectory.
{code:java}
Path sourcePath = new Path(dfs.getWorkingDirectory(), "source");
initData10(sourcePath);
dfs.createEncryptionZone(source, "test");
String[] args = new String[]
{"-update","/.reserved/raw"+source.toString(),
"/.reserved/raw"+target.toString() };
new DistCp(conf, OptionsParser.parse(args)).execute();
RemoteIterator<EncryptionZone> listEncryptionZones =
dfs.listEncryptionZones();
while (listEncryptionZones.hasNext()) {
System.out.println("Encryption Zone :: " +
listEncryptionZones.next().getPath());
}
{code}
{code:java}
Encryption Zone :: /source
{code}
was (Author: rohit.peg):
This looks inline with the intended behavior of -update option
[https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html]
{code:java}
-update is used to copy files from source that don’t exist at the target or
differ from the target version. -overwrite overwrites target-files that exist
at the target.
The Update and Overwrite options warrant special attention since their handling
of source-paths varies from the defaults in a very subtle manner. Consider a
copy from /source/first/ and /source/second/ to /target/, where the source
paths have the following contents:
hdfs://nn1:8020/source/first/1
hdfs://nn1:8020/source/first/2
hdfs://nn1:8020/source/second/10
hdfs://nn1:8020/source/second/20
When DistCp is invoked without -update or -overwrite, the DistCp defaults would
create directories first/ and second/, under /target. Thus:
distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second
hdfs://nn2:8020/target
would yield the following contents in /target:
hdfs://nn2:8020/target/first/1
hdfs://nn2:8020/target/first/2
hdfs://nn2:8020/target/second/10
hdfs://nn2:8020/target/second/20
When either -update or -overwrite is specified, the *contents* of the
source-directories are copied to target, and not the source directories
themselves.
Thus:
distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second
hdfs://nn2:8020/target
would yield the following contents in /target:
hdfs://nn2:8020/target/1
hdfs://nn2:8020/target/2
hdfs://nn2:8020/target/10
hdfs://nn2:8020/target/20
{code}
> Distcp with -update feature on first time raw data not working
> --------------------------------------------------------------
>
> Key: HADOOP-13023
> URL: https://issues.apache.org/jira/browse/HADOOP-13023
> Project: Hadoop Common
> Issue Type: Bug
> Components: tools/distcp
> Affects Versions: 2.6.0
> Reporter: Mavin Martin
> Priority: Major
>
> When attempting to do a distcp with the -update feature toggled on encrypted
> data, the distcp shows as successful. Reading the encrypted file on the
> target_path does not work since the keyName does not exist.
> Please see my example to reproduce the issue.
> {code}
> [root@xxx bin]# hdfs crypto -listZones
> /tmp/a/ted DEF0000000000013
> [root@xxx bin]# hdfs dfs -ls -R /tmp
> drwxr-xr-x - xxx xxx 0 2016-04-14 00:22 /tmp/a
> drwxr-xr-x - xxx xxx 0 2016-04-14 00:00 /tmp/a/ted
> -rw-r--r-- 3 xxx xxx 33 2016-04-14 00:00 /tmp/a/ted/test.txt
> [root@xxx bin]# hadoop distcp -update /.reserved/raw/tmp/a/ted
> /.reserved/raw/tmp/a-with-update/ted
> [root@xxx bin]# hdfs crypto -listZones
> /tmp/a/ted DEF0000000000013
> [root@xxx bin]# hadoop distcp /.reserved/raw/tmp/a/ted
> /.reserved/raw/tmp/a-no-update/ted
> [root@xxx bin]# hdfs crypto -listZones
> /tmp/a/ted DEF0000000000013
> /tmp/a-no-update/ted DEF0000000000013
> {code}
> The crypto zone for 'a-with-update' should have been created since this is a
> new destination. You can verify this by looking at 'a-no-update'.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]