[jira] [Updated] (HADOOP-13560) S3ABlockOutputStream to support huge (many GB) file writes

Steve Loughran (JIRA) Mon, 26 Sep 2016 13:49:34 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-13560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Loughran updated HADOOP-13560:
------------------------------------
    Status: Patch Available  (was: Open)

Commit fc16e03c; Patch 005. Moved all the operations in the block output stream 
which directly interacted with the s3 client into a new inner class of 
S3AFilesSystem, WriteOperationState. This cleanly separates interaction between 
the output stream —buffering of data and queuing of uploads— from the upload 
process itself. I think S3Guard may be able to do something with this, but I 
also hope to use it as a start for async directory list/delete operations; this 
class would track create-time probes, and initiate the async deletion of 
directory objects after a successful write. That's why there are separate 
callbacks for writeSuccessful and writeFailed...we will only want to spawn off 
the deletion when the write succeeded. 

In the process of coding all this, managed to break multipart uploads: this has 
led to a clearer understanding of how part uploads fail, an improvement in 
statistics collection and in the test.
    
Otherwise,
* trying to get the imports in sync with branch-2; IDE somehow rearranged 
things.
* docs in more detail
* manual testing through all the FS operations
* locally switched all the s3a tests into using this (i.e. turned on block 
output in auth-keys.xml)

I think this is ready for review and play. I'd recommend the disk block buffer 
except in the special case that you know that you can upload data faster than 
you can generate, and you wan't to bypass the disk. But I'd be curious about 
performance numbers there, especially on distcp operations with s3a as the 
destination




> S3ABlockOutputStream to support huge (many GB) file writes
> ----------------------------------------------------------
>
>                 Key: HADOOP-13560
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13560
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.9.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>         Attachments: HADOOP-13560-branch-2-001.patch, 
> HADOOP-13560-branch-2-002.patch, HADOOP-13560-branch-2-003.patch, 
> HADOOP-13560-branch-2-004.patch
>
>
> An AWS SDK [issue|https://github.com/aws/aws-sdk-java/issues/367] highlights 
> that metadata isn't copied on large copies.
> 1. Add a test to do that large copy/rname and verify that the copy really 
> works
> 2. Verify that metadata makes it over.
> Verifying large file rename is important on its own, as it is needed for very 
> large commit operations for committers using rename



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-13560) S3ABlockOutputStream to support huge (many GB) file writes

Reply via email to