[ 
https://issues.apache.org/jira/browse/HADOOP-15281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760999#comment-16760999
 ] 

Steve Loughran commented on HADOOP-15281:
-----------------------------------------

right, really good piece fo work, nearly ready to go in. 

I like the tests in particular..nicely done, including the s3a extension to 
verify that the #of renames is zero

Some minor change in RetriableFileCopyCommand.doCopy() covered below.

Before that, How about the option is called "direct"? keeps it easier to 
remember. Code can stay as is, but the CLI option & docs can be changed. 

I just like a {{-direct -update}} as a set of commands (I never get 
-skipCrcCheck right; that's precisely what I don't want to replicate -people 
mistakenly going -directWrite. A short "direct" avoids case confusion

h3. RetriableFileCopyCommand.doCopy()

L122 rename useTmpTarget to useTempTarget

L125 (and all other new log statements) We're using the SLF4J API, so can use 
LOG.info("Writing to {} target file path {}, 
   useTempTarget ? "temporary" ? "direct", targetPath)

This also means you can skip the LOG.isDebugEnabled() as the string eval & 
concat is only done if there's a log event. (Don't bother with the existing 
logs, just worry about those which you are adding/changing)

L172: no need to call targetFS.exists(targetPath) before the delete; delete 
does the checks and is a no-op if the destination doesn't exist. Saves many 
HTTP requests against a store.

One thing I wondered about is "could we actually log the duration of 
operations, e.g. the rename()". But the lib to do that (DurationInfo) is in 
hadoop-aws; I think moving that would be good, but it's a separate bit of work 
(HADOOP-16093). So don't worry about it. That said, CopyCommitter.deleteMissing 
does have some variant of that code: distcp would be the obvious place to add a 
chunk of this.

I don't want to have that block this patch, or complicate backporting (I will 
cherrypick this to branch-2, but not -2.8).

Accordingly:

# fix those bits of RetriableFileCopyCommand
# tell me what you think of having the option "direct", and if you are happy 
with it, do that change.

thanks


> Distcp to add no-rename copy option
> -----------------------------------
>
>                 Key: HADOOP-15281
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15281
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 3.0.0
>            Reporter: Steve Loughran
>            Assignee: Andrew Olson
>            Priority: Major
>         Attachments: HADOOP-15281-001.patch, HADOOP-15281-002.patch, 
> HADOOP-15281-003.patch
>
>
> Currently Distcp uploads a file by two strategies
> # append parts
> # copy to temp then rename
> option 2 executes the following sequence in {{promoteTmpToTarget}}
> {code}
>     if ((fs.exists(target) && !fs.delete(target, false))
>         || (!fs.exists(target.getParent()) && !fs.mkdirs(target.getParent()))
>         || !fs.rename(tmpTarget, target)) {
>       throw new IOException("Failed to promote tmp-file:" + tmpTarget
>                               + " to: " + target);
>     }
> {code}
> For any object store, that's a lot of HTTP requests; for S3A you are looking 
> at 12+ requests and an O(data) copy call. 
> This is not a good upload strategy for any store which manifests its output 
> atomically at the end of the write().
> Proposed: add a switch to write direct to the dest path. either a conf option 
> or a CLI option



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to