[
https://issues.apache.org/jira/browse/HDFS-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yongjun Zhang updated HDFS-6152:
--------------------------------
Status: Patch Available (was: Open)
The submitted patch tries to address the two issues reported.
Some notable changes:
1. A new boolean field "targetPathExists" is introduced to DistCpOptions
class. The value of this field is a derived by checking whether the target path
exists or not in the beginning of distcp. (Arguably, this information could be
put somewhere else, but I found DistCpOption is the most suitable place based
on the current DistCp implementation).
A new corresponding jobconf property CONF_LABEL_TARGET_PATH_EXISTS is
introduced, and it's initialized at the same time as the targetPathExists field.
The reason is that the result of class SimpleCopyListing's method
computeSourceRootPath depends on DistCpOption. E.g., whether the distcp target
exists or not, whether -update or -overwrite switches are passed. And Item 3
below needs this info (via the new jobconf property).
Unit tests that use DistCpOptions need to be aware of the need to set this
filed according to the test-case's setting.
2. For the issues reported in this JIRA, an entry that was skipped by
writeToFileListing method with the following code:
{code}
if (fileStatus.getPath().equals(sourcePathRoot) && fileStatus.isDirectory())
return; // Skip the root-paths.
{code}
is now added to the filelisting when no -update/-overwrite is specified.
This entry is recognized by both the CopyMapper and CopyCommitter.
Using this entry, the CopyMapper will create dir accordingly (for ISSUE 2), and
the CopyCommitter will update attributes when specified (for ISSUE 1).
E.g., distcp a/b xyz, where a/b is the source dir,
a. if xyz doesn't exist, then "a/b" is written to the copyListing with empty
relative path "".
b. if xyz exists, then "a/b" is written to the copyListing with relative path
"b".
3.
class CopyCommitter's method deleteMissing creates a DistCpOption object with
default setting, and collect listing from prior-to-committing result of distcp.
This is not sufficient for the above mentioned reason (The result of class
SimpleCopyListing's method computeSourceRootPath depends on DistCpOption). The
problem is revealed with the change I added to fix this JIRA, and the patch I
submitted addressed it.
Thanks for reviewing.
> distcp V2 doesn't preserve root dir's attributes when -p is specified
> ---------------------------------------------------------------------
>
> Key: HDFS-6152
> URL: https://issues.apache.org/jira/browse/HDFS-6152
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs-client
> Affects Versions: 2.3.0
> Reporter: Yongjun Zhang
> Assignee: Yongjun Zhang
> Attachments: HDFS-6152.001.patch
>
>
> Two issues were observed with distcpV2
> ISSUE 1. when copying a source dir to target dir with "-pu" option using
> command
> "distcp -pu source-dir target-dir"
>
> The source dir's owner is not preserved at target dir. Simiarly other
> attributes of source dir are not preserved. Supposedly they should be
> preserved when no -update and no -overwrite specified.
> There are two scenarios with the above command:
> a. when target-dir already exists. Issuing the above command will result in
> target-dir/source-dir (source-dir here refers to the last component of the
> source-dir path in the command line) at target file system, with all contents
> in source-dir copied to under target-dir/src-dir. The issue in this case is,
> the attributes of src-dir is not preserved.
> b. when target-dir doesn't exist. It will result in target-dir with all
> contents of source-dir copied to under target-dir. This issue in this case
> is, the attributes of source-dir is not carried over to target-dir.
> For multiple source cases, e.g., command
> "distcp -pu source-dir1 source-dir2 target-dir"
> No matter whether the target-dir exists or not, the multiple sources are
> copied to under the target dir (target-dir is created if it didn't exist).
> And their attributes are preserved.
> ISSUE 2. with the following command:
> "distcp source-dir target-dir"
> when source-dir is an empty directory, and when target-dir doesn't exist,
> source-dir is not copied, actually the command behaves like a no-op. However,
> when the source-dir is not empty, it would be copied and results in
> target-dir at the target file system containing a copy of source-dir's
> children.
> To be consistent, empty source dir should be copied too. Basically the above
> distcp command should cause target-dir get created at target file system, and
> the source-dir's attributes are preserved at target-dir when -p is passed.
--
This message was sent by Atlassian JIRA
(v6.2#6252)