[jira] [Commented] (HADOOP-16132) Support multipart download in S3AFileSystem

Gabor Bota (JIRA) Wed, 20 Mar 2019 06:45:24 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797184#comment-16797184
 ]


Gabor Bota commented on HADOOP-16132:
-------------------------------------

I tried to apply your latest patch (v005) to trunk, but it does not apply:
{noformat}
error: patch failed: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java:641
error: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java: 
patch does not apply
error: patch failed: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java:18
error: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java:
 patch does not apply
{noformat}

Could you rebase it, please?

> Support multipart download in S3AFileSystem
> -------------------------------------------
>
>                 Key: HADOOP-16132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16132
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Justin Uang
>            Priority: Major
>         Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, 
> HADOOP-16132.003.patch, HADOOP-16132.004.patch, HADOOP-16132.005.patch, 
> seek-logs-parquet.txt
>
>
> I noticed that I get 150MB/s when I use the AWS CLI
> {code:java}
> aws s3 cp s3://<bucket>/<key> - > /dev/null{code}
> vs 50MB/s when I use the S3AFileSystem
> {code:java}
> hadoop fs -cat s3://<bucket>/<key> > /dev/null{code}
> Looking into the AWS CLI code, it looks like the 
> [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
>  logic is quite clever. It downloads the next couple parts in parallel using 
> range requests, and then buffers them in memory in order to reorder them and 
> expose a single contiguous stream. I translated the logic to Java and 
> modified the S3AFileSystem to do similar things, and am able to achieve 
> 150MB/s download speeds as well. It is mostly done but I have some things to 
> clean up first. The PR is here: 
> https://github.com/palantir/hadoop/pull/47/files
> It would be great to get some other eyes on it to see what we need to do to 
> get it merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-16132) Support multipart download in S3AFileSystem

Reply via email to