[ 
https://issues.apache.org/jira/browse/HADOOP-12891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-12891:
------------------------------------
    Attachment: HADOOP-12891-002.patch

Patch 002: Adds the documentation.

Excluding the docs, this patch is andrew's work, just converted into a .patch 
file. Therefore I still consider myself in a position to be a reviewer. 
However, I'd still like others with s3 access to test this, just to make sure 
there aren't surprises.

My tests were against Amazon S3 Ireland, BTW

> S3AFileSystem should configure Multipart Copy threshold and chunk size
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-12891
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12891
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>    Affects Versions: 2.7.2
>            Reporter: Andrew Olson
>            Assignee: Andrew Olson
>         Attachments: HADOOP-12891-001.patch, HADOOP-12891-002.patch
>
>
> In the AWS S3 Java SDK the defaults for Multipart Copy threshold and chunk 
> size are very high [1],
> {noformat}
>     /** Default size threshold for Amazon S3 object after which multi-part 
> copy is initiated. */
>     private static final long DEFAULT_MULTIPART_COPY_THRESHOLD = 5 * GB;
>     /** Default minimum size of each part for multi-part copy. */
>     private static final long DEFAULT_MINIMUM_COPY_PART_SIZE = 100 * MB;
> {noformat}
> In internal testing we have found that a lower but still reasonable threshold 
> and chunk size can be extremely beneficial. In our case we set both the 
> threshold and size to 25 MB with good results.
> Amazon enforces a minimum of 5 MB [2].
> For the S3A filesystem, file renames are actually implemented via a remote 
> copy request, which is already quite slow compared to a rename on HDFS. This 
> very high threshold for utilizing the multipart functionality can make the 
> performance considerably worse, particularly for files in the 100MB to 5GB 
> range which is fairly typical for mapreduce job outputs.
> Two apparent options are:
> 1) Use the same configuration ({{fs.s3a.multipart.threshold}}, 
> {{fs.s3a.multipart.size}}) for both. This seems preferable as the 
> accompanying documentation [3] for these configuration properties actually 
> already says that they are applicable for either "uploads or copies". We just 
> need to add in the missing 
> {{TransferManagerConfiguration#setMultipartCopyThreshold}} [4] and 
> {{TransferManagerConfiguration#setMultipartCopyPartSize}} [5] calls at [6] 
> like:
> {noformat}
>     /* Handle copies in the same way as uploads. */
>     transferConfiguration.setMultipartCopyPartSize(partSize);
>     transferConfiguration.setMultipartCopyThreshold(multiPartThreshold);
> {noformat}
> 2) Add two new configuration properties so that the copy threshold and part 
> size can be independently configured, maybe change the defaults to be lower 
> than Amazon's, set into {{TransferManagerConfiguration}} in the same way.
> In any case at a minimum if neither of the above options are acceptable 
> changes the config documentation should be adjusted to match the code, noting 
> that {{fs.s3a.multipart.threshold}} and {{fs.s3a.multipart.size}} are 
> applicable to uploads of new objects only and not copies (i.e. renaming 
> objects).
> [1] 
> https://github.com/aws/aws-sdk-java/blob/1.10.58/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/transfer/TransferManagerConfiguration.java#L36-L40
> [2] http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPartCopy.html
> [3] 
> https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#S3A
> [4] 
> http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManagerConfiguration.html#setMultipartCopyThreshold(long)
> [5] 
> http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManagerConfiguration.html#setMultipartCopyPartSize(long)
> [6] 
> https://github.com/apache/hadoop/blob/release-2.7.2-RC2/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L286



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to