[
https://issues.apache.org/jira/browse/HADOOP-16047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742212#comment-16742212
]
Andrew Olson commented on HADOOP-16047:
---------------------------------------
Instead of a new config property, would something like this make sense?
{noformat}
final boolean toAppend = action == FileAction.APPEND;
final boolean directWriteToS3 =
target.toUri().getScheme().equals("s3a");
final boolean useTmpTarget = !toAppend && !directWriteToS3;
Path targetPath = useTmpTarget ? getTmpFile(target, context) : target;
{noformat}
> Avoid expensive rename when DistCp is writing to S3
> ---------------------------------------------------
>
> Key: HADOOP-16047
> URL: https://issues.apache.org/jira/browse/HADOOP-16047
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs/s3, tools/distcp
> Reporter: Andrew Olson
> Priority: Major
>
> When writing to an S3-based target, the temp file and rename logic in
> RetriableFileCopyCommand adds some unnecessary cost to the job, as the rename
> operation does a server-side copy + delete in S3 [1]. The renames are
> parallelized across all of the DistCp map tasks, so the severity is mitigated
> to some extent. However a configuration property to conditionally allow
> distributed copies to avoid that expense and write directly to the target
> path would improve performance considerably.
> [1]
> https://github.com/apache/hadoop/blob/release-3.2.0-RC1/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/introduction.md#object-stores-vs-filesystems
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]