[
https://issues.apache.org/jira/browse/HADOOP-13002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225238#comment-15225238
]
Joep Rottinghuis commented on HADOOP-13002:
-------------------------------------------
The problem is that a new derived property is added to DistCpOptionSwitch
Then in Distcp we have
{code}
public int run(String[] argv) {
...
try {
inputOptions = (OptionsParser.parse(argv));
setTargetPathExists();
LOG.info("Input Options: " + inputOptions);
{code}
This doesn't get executed when calling distcp from a java program, because in
that case a deparate DistCpOptionSwitch gets created and DistCp is executed
like so:
{code}
Path targetPath = ...
DistCpOptions options = new DistCpOptions(sourcePaths, targetPath);
// Set desired options
options.preserve(DistCpOptions.FileAttribute.BLOCKSIZE);
options.preserve(DistCpOptions.FileAttribute.CHECKSUMTYPE);
options.preserve(DistCpOptions.FileAttribute.REPLICATION);
options.preserve(DistCpOptions.FileAttribute.TIMES);
options.preserve(DistCpOptions.FileAttribute.USER);
options.preserve(DistCpOptions.FileAttribute.GROUP);
options.preserve(DistCpOptions.FileAttribute.PERMISSION);
options.setBlocking(true);
// -atomic defaults to false
options.setAtomicCommit(false);
// -update defaults to false
options.setSyncFolder(false);
Configuration conf = ...
Job job = null;
DistCp distCp = new DistCp(conf, options);
job = distCp.execute();
{code}
in DistcpOptions targetPathExists defaults to true, the comment even suggests
this:
{code}
// targetPathExist is a derived field, it's initialized in the
// beginning of distcp.
private boolean targetPathExists = true;
{code}
The key problem is that it is initialized only from the run method, not in the
execute method.
With the path not existing, but DistCp assuming it does exist when executed
from code, the following special handling fails to compute correctly in
SimpleCopyListing.computeSourceRootPath
{code}
boolean specialHandling =
(options.getSourcePaths().size() == 1 && !targetPathExists)
|| options.shouldSyncFolder() || options.shouldOverwrite();
{code}
As a result the parent directory is prepended to the files in the generated
sequence file which results in the extra xyz/c directory in the target.
The least amount of code change would be to move the initialization in DistCP
from run to execute, however that has three disadvantages:
1) We'd also have to move the logging statement, otherwise an incorrectly
initialized variable is logged. That means that the invocation from Java
suddenly does additional logging.
2) Having a derived method that could be explicitly set (to incorrect value)
and then later overwritten during execution is weird.
3) With hindsight we should have never introduced a setter for the derived
property. It allows the option and the associated property in the config to get
out of sync. Moving the external setting doesn't fix this.
I think the proper solution would be to remove the setTargetPathExists method
from DistCp and to move it to DistCpOptions and to invoke it from both
constructors (but not the copy constructor).
Now that the setTargetPathExists method was added in 2.5 we cannot remove it
w/o breaking backwards compatibility, but I think we should mark it as
deprecated. We could also consider making it a no-op given that the constructor
already calculates it. Possibly we can make it ignore the argument and
re-calculate the exists value. I don't see the value in letting people set an
incorrect value (of path existing when it doesn't and visa versa).
However, the setting of the conf property has to remain in the distcp.execute
method, because the DistCpOptions class isn't a Configurable and will therefore
not have access to getConf(). Passing the conf into the constructor of
DistCpOptions would change its signature which we cannot do.
> distcp behaves differently through code compared to toolrunner invocation
> from command-line
> -------------------------------------------------------------------------------------------
>
> Key: HADOOP-13002
> URL: https://issues.apache.org/jira/browse/HADOOP-13002
> Project: Hadoop Common
> Issue Type: Bug
> Components: tools/distcp
> Affects Versions: 2.5.0, 2.6.0, 2.7.0, 3.0.0
> Reporter: Joep Rottinghuis
>
> In Hadoop 2.5 the behavior of distcp changed when called through code iff the
> target directory did not exist and update wasn't used and atomic wasn't used.
> HADOOP-10459 introduced a change to preserve the root directory attributes.
> It introduced a derivative property in the options as well as in the
> configuration whether the target path exists. See
> https://github.com/apache/hadoop/commit/c5b59477775c797944db4992e8a70289ba2895ed
> However, this property is set only when distcp is used through the command
> line as a ToolRunner in Distcp.run(String[] argv).
> The result is that when the target directory doesn't exist (and neither
> -update nor -atomic options are used) SimplyCopyListing incorrectly assumes
> that the target directory does exist because the attribute defaults to true.
> Copying directory a/b/c to xyz results in the creation of a xyx/c directory
> with the content of c in it, rather than the content of c getting copied into
> directory xyz directly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)