[jira] [Commented] (HADOOP-13868) S3A should configure multi-part copies and uploads separately

Sean Mackrory (JIRA) Tue, 06 Dec 2016 19:01:36 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15727506#comment-15727506
 ]


Sean Mackrory commented on HADOOP-13868:
----------------------------------------

Just to add to the numbers above, 256 MB / 512 MB seemed to be the right spot 
when I was computing in us-west-2  and storing in us-west-1. If I do both in 
us-west-1, it would seem about 100 MB is where multipart uploads become faster 
for both upload and rename. Would be good to get more data on how that changes 
- I'll post the script I'm using in case other folks want to try it for their 
own set ups.

Given that I'm not seeing a huge discrepancy between upload and rename speeds 
(I was originally considering the 512 MB that was proving the right transition 
point for multipart renames vs. the 16 MB that is the default in 
core-default.xml), maybe the right thing to do here is indeed to keep the 
configuration together, but consider bumping the value in core-default.xml up 
(I definitely haven't seen performance fluctuations in S3 over time quite that 
large, so I'm reluctant to write it off as that). </thinking out loud>

> S3A should configure multi-part copies and uploads separately
> -------------------------------------------------------------
>
>                 Key: HADOOP-13868
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13868
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 2.7.0, 3.0.0-alpha1
>            Reporter: Sean Mackrory
>            Assignee: Sean Mackrory
>
> I've been looking at a big performance regression when writing to S3 from 
> Spark that appears to have been introduced with HADOOP-12891.
> In the Amazon SDK, the default threshold for multi-part copies is 320x the 
> threshold for multi-part uploads (and the block size is 20x bigger), so I 
> don't think it's necessarily wise for us to have them be the same.
> I did some quick tests and it seems to me the sweet spot when multi-part 
> copies start being faster is around 512MB. It wasn't as significant, but 
> using 104857600 (Amazon's default) for the blocksize was also slightly better.
> I propose we do the following, although they're independent decisions:
> (1) Split the configuration. Ideally, I'd like to have 
> fs.s3a.multipart.copy.threshold and fs.s3a.multipart.upload.threshold (and 
> corresponding properties for the block size). But then there's the question 
> of what to do with the existing fs.s3a.multipart.* properties. Deprecation? 
> Leave it as a short-hand for configuring both (that's overridden by the more 
> specific properties?).
> (2) Consider increasing the default values. In my tests, 256 MB seemed to be 
> where multipart uploads came into their own, and 512 MB was where multipart 
> copies started outperforming the alternative. Would be interested to hear 
> what other people have seen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13868) S3A should configure multi-part copies and uploads separately

Reply via email to