[jira] [Commented] (HADOOP-11183) Memory-based S3AOutputstream

Thomas Demoor (JIRA) Wed, 18 Feb 2015 03:39:50 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-11183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325758#comment-14325758
 ]


Thomas Demoor commented on HADOOP-11183:
----------------------------------------

I have some open questions: 

*Exceptions*
Server side exceptions are thrown as AmazonServiceException (which extends 
AmazonClientException) returns quite some info on toString():
{code}
public String getMessage() {
        return "Status Code: " + getStatusCode() + ", "
            + "AWS Service: " + getServiceName() + ", "
            + "AWS Request ID: " + getRequestId() + ", "
            + "AWS Error Code: " + getErrorCode() + ", "
            + "AWS Error Message: " + super.getMessage();
    }
{code}
The error codes are detailed here: 
http://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html 

A more general AmazonClientException is thrown if the server could not be 
reached (or there is another client side problem). Do you want me to wrap the 
entire list in the link above in standard Java exceptions? I agree that typed 
exceptions are good as they provide more info, but considering the detailed 
response codes, this might not be the top priority. 

*Statistics*

Unlike hdfs, s3a passes null to the wrapping FSDataOutputStream and does the 
statistics counting itself. It counts the bytes transferred to the server 
(double counting retries, etc.) by adding listeners (from AWS lib) to the 
uploads. It also calls statistics.incrementWriteOps(1) on every part of a 
multipart upload. It thus gives an S3-centric view of the filesystem stats, not 
a Hadoop one.

The introduced S3AFastOutputStream leaves bytecounting to FSDataOutputStream 
(cfr. hdfs) and only counts a writeOp per successful create(). It thus has 
different behavior. Should I revert to the S3AOutputStream way or will we 
change that to be HDFS like (in a separate jira)? 

*Failures*

Currently, failure of a MultiPartUpload is only checked upon closing the file. 
So f.i. if the server is unreachable each part waits for the connection setup 
timeout to fail, which takes a while. Once one part has failed, we should abort 
asap. I think adding a callback to each partUpload(ListenableFuture) that sets 
an AtomicBoolean failed = true if it has failed and checking this before 
starting a partUpload allows us to throw the Exception at the start of the next 
partUpload.

> Memory-based S3AOutputstream
> ----------------------------
>
>                 Key: HADOOP-11183
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11183
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.6.0
>            Reporter: Thomas Demoor
>            Assignee: Thomas Demoor
>         Attachments: HADOOP-11183-004.patch, HADOOP-11183.001.patch, 
> HADOOP-11183.002.patch, HADOOP-11183.003.patch, design-comments.pdf
>
>
> Currently s3a buffers files on disk(s) before uploading. This JIRA 
> investigates adding a memory-based upload implementation.
> The motivation is evidently performance: this would be beneficial for users 
> with high network bandwidth to S3 (EC2?) or users that run Hadoop directly on 
> an S3-compatible object store (FYI: my contributions are made in name of 
> Amplidata). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HADOOP-11183) Memory-based S3AOutputstream

Reply via email to