[ 
https://issues.apache.org/jira/browse/HADOOP-13002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225238#comment-15225238
 ] 

Joep Rottinghuis commented on HADOOP-13002:
-------------------------------------------

The problem is that a new derived property is added to DistCpOptionSwitch 

Then in Distcp we have
{code}
  public int run(String[] argv) {
   ...
    try {
      inputOptions = (OptionsParser.parse(argv));
      setTargetPathExists();
      LOG.info("Input Options: " + inputOptions);
{code}
This doesn't get executed when calling distcp from a java program, because in 
that case a deparate DistCpOptionSwitch gets created and DistCp is executed 
like so:
{code}
  Path targetPath = ...
  DistCpOptions options = new DistCpOptions(sourcePaths, targetPath);
  
  // Set desired options
  options.preserve(DistCpOptions.FileAttribute.BLOCKSIZE);
  options.preserve(DistCpOptions.FileAttribute.CHECKSUMTYPE);
  options.preserve(DistCpOptions.FileAttribute.REPLICATION);
  options.preserve(DistCpOptions.FileAttribute.TIMES);
  options.preserve(DistCpOptions.FileAttribute.USER);
  options.preserve(DistCpOptions.FileAttribute.GROUP);
  options.preserve(DistCpOptions.FileAttribute.PERMISSION);
  options.setBlocking(true);
  // -atomic defaults to false
  options.setAtomicCommit(false);
  // -update defaults to false
  options.setSyncFolder(false);
  
  Configuration conf = ...
  Job job = null;
  DistCp distCp = new DistCp(conf, options);
  job = distCp.execute();
{code}

in DistcpOptions targetPathExists defaults to true, the comment even suggests 
this:
{code}
  // targetPathExist is a derived field, it's initialized in the
  // beginning of distcp.
  private boolean targetPathExists = true;
{code}
The key problem is that it is initialized only from the run method, not in the 
execute method.

With the path not existing, but DistCp assuming it does exist when executed 
from code, the following special handling fails to compute correctly in 
SimpleCopyListing.computeSourceRootPath
{code}
      boolean specialHandling =
          (options.getSourcePaths().size() == 1 && !targetPathExists)
              || options.shouldSyncFolder() || options.shouldOverwrite();
{code}
As a result the parent directory is prepended to the files in the generated 
sequence file which results in the extra xyz/c directory in the target.

The least amount of code change would be to move the initialization in DistCP 
from run to execute, however that has three disadvantages:
1) We'd also have to move the logging statement, otherwise an incorrectly 
initialized variable is logged. That means that the invocation from Java 
suddenly does additional logging.
2) Having a derived method that could be explicitly set (to incorrect value) 
and then later overwritten during execution is weird.
3) With hindsight we should have never introduced a setter for the derived 
property. It allows the option and the associated property in the config to get 
out of sync. Moving the external setting doesn't fix this.

I think the proper solution would be to remove the setTargetPathExists method 
from DistCp and to move it to DistCpOptions and to invoke it from both 
constructors (but not the copy constructor).
Now that the setTargetPathExists method was added in 2.5 we cannot remove it 
w/o breaking backwards compatibility, but I think we should mark it as 
deprecated. We could also consider making it a no-op given that the constructor 
already calculates it. Possibly we can make it ignore the argument and 
re-calculate the exists value. I don't see the value in letting people set an 
incorrect value (of path existing when it doesn't and visa versa).

However, the setting of the conf property has to remain in the distcp.execute 
method, because the DistCpOptions class isn't a Configurable and will therefore 
not have access to getConf(). Passing the conf into the constructor of 
DistCpOptions would change its signature which we cannot do.

> distcp behaves differently through code compared to toolrunner invocation 
> from command-line
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-13002
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13002
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: tools/distcp
>    Affects Versions: 2.5.0, 2.6.0, 2.7.0, 3.0.0
>            Reporter: Joep Rottinghuis
>
> In Hadoop 2.5 the behavior of distcp changed when called through code iff the 
> target directory did not exist and update wasn't used and atomic wasn't used.
> HADOOP-10459 introduced a change to preserve the root directory attributes. 
> It introduced a derivative property in the options as well as in the 
> configuration whether the target path exists. See 
> https://github.com/apache/hadoop/commit/c5b59477775c797944db4992e8a70289ba2895ed
> However, this property is set only when distcp is used through the command 
> line as a ToolRunner in Distcp.run(String[] argv).
> The result is that when the target directory doesn't exist (and neither 
> -update nor -atomic options are used) SimplyCopyListing incorrectly assumes 
> that the target directory does exist because the attribute defaults to true. 
> Copying directory a/b/c to xyz results in the creation of a xyx/c directory 
> with the content of c in it, rather than the content of c getting copied into 
> directory xyz directly.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to