[jira] [Commented] (HADOOP-13560) S3A to support huge file writes and operations -with tests

ASF GitHub Bot (JIRA) Thu, 08 Sep 2016 03:33:44 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-13560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15473510#comment-15473510
 ]


ASF GitHub Bot commented on HADOOP-13560:
-----------------------------------------

GitHub user steveloughran opened a pull request:

    https://github.com/apache/hadoop/pull/125

    HADOOP-13560 S3A to support huge file writes and operations -with tests

    Adds 
    
    ## Scale tests for S3A huge file support; 
    - always running at the MB size (maybe best to make optional)
     -configurable to bigger sizes in the auth-keys XML or in the build 
`-Dfs.s3a.scale.test.huge.filesize=1000`
    - limited to upload, seek, read, rename, delete. The JUnit test cases are 
explicltly set up to run in order here.
    
    ## New scalable output stream for writing, `S3ABlockOutputStream`
    
    -always saves in incremental blocks as writes proceed, block size == 
partition size.
    -supports Fast output stream memory buffer code (for regression testing)
    -supports a back end which buffers blocks in files, using RR disk 
allocation. As such, write/read bandwidth is limited to aggregate HDD bandwidth.
    -adding extra failure resilience as testing throws up failure conditions 
(network timeouts, no-response from server on multipart commit, etc).
    -adding instrumentation, including using callbacks from AWS SDK to update 
gauges and counters (in progress)
    
    What we have here is essentially something that can replace the classic 
"save to file, upload at the end" stream and the fast "store it all in RAM and 
hope there's space" stream. It should offer incremental upload for faster 
output of larger files compared the classic file stream, with the scaleability 
the fast one lacks. And the instrumentation to show what's happening.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/steveloughran/hadoop s3/HADOOP-13560-5GB-blobs

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/hadoop/pull/125.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #125
    
----
commit d21600ead50aaafed1611b00206991c7d2c5934f
Author: Steve Loughran <[email protected]>
Date:   2016-08-30T15:48:14Z

    HADOOP-13560 adding test for create/copy 5GB files

commit 13b0544fffe7feb3e6d7404c90f222f1ae6644bb
Author: Steve Loughran <[email protected]>
Date:   2016-08-31T11:29:28Z

    HADOOP-13560 tuning test scale and timeouts

commit fb6a70c8d2b36c66d7b3ae732d9afd80b436a512
Author: Steve Loughran <[email protected]>
Date:   2016-08-31T13:21:23Z

    HADOOP-13560 scale tests take maven build arguments

commit d09aad6377fc37912d1c47355a191bc3279a4016
Author: Steve Loughran <[email protected]>
Date:   2016-08-31T13:33:47Z

    HADOOP-13567 S3AFileSystem to override getStoragetStatistics() and so serve 
up its statistics

commit e8afc25621e3552b80463084df29f785ecde6807
Author: Steve Loughran <[email protected]>
Date:   2016-08-31T13:34:26Z

    HADOOP-13566 NPE in S3AFastOutputStream.write

commit dfa90a08d18b7cda8c135ba8b838929a28784a47
Author: Steve Loughran <[email protected]>
Date:   2016-08-31T14:31:39Z

    HADOOP-13560 use STest as prefix for scale tests

commit 27365023e9363763c300e81bdefcb45887131ce4
Author: Steve Loughran <[email protected]>
Date:   2016-08-31T17:06:51Z

    HADOOP-13560 test improvements

commit a46781589ae8cedbdfeabb92fcc1ca83afc21b4c
Author: Steve Loughran <[email protected]>
Date:   2016-08-31T17:07:29Z

    HADOOP-13560 fix typo in the name of a statistic

commit cfdb0f0dbe8231a63046ba19900ea46645462bcb
Author: Steve Loughran <[email protected]>
Date:   2016-08-31T17:08:42Z

    HADOOP-13569 S3AFastOutputStream to take ProgressListener in file create()

commit 8ffd7a90fff7a5ed460b0396232d5322a06f8e59
Author: Steve Loughran <[email protected]>
Date:   2016-09-01T17:05:01Z

    HADOOP-13560 lots of improvement in test and monitoring of what is going on 
inside S3A, including a gauge of active request counts. +more troubleshooting 
docs. The fast output stream will retry on errors

commit 750e9462b7bd267915f9d91cbab0cd0ba51f1c41
Author: Steve Loughran <[email protected]>
Date:   2016-08-28T11:38:40Z

    HADOOP-13531 S3 output stream allocator to round-robin directories

commit 51c27278bcfa067109efa702deed6890db677895
Author: Steve Loughran <[email protected]>
Date:   2016-09-05T17:48:36Z

    HADOOP-13560 WiP: adding new incremental output stream

commit e1ce5a804a1c5d0afddf21362fe5a8d7d5179c58
Author: Steve Loughran <[email protected]>
Date:   2016-09-06T13:46:50Z

    HADOOP-13560 data block design is coalescing and memory buffer writes are 
passing tests

commit db1ed581b26c0320209017a09e77754638e7c42a
Author: Steve Loughran <[email protected]>
Date:   2016-09-06T19:58:14Z

    HADOOP-13560 patch 002
    block streaming is in, testing at moderate scale <100 MB.
    
    you can choose for buffer-by-ram (current fast uploader) or buffer by HDD; 
in a test using SSD & remote S3, I got ~1.38MB/s bandwidth, got something 
similar 1.44 on RAM. But: we shouldn't run out off heap on the HDD option. RAM 
buffering uses existing ByteArrays, to ease source code migration off 
FastUpload (which is still there, for now).
    
    * I do plan to add pooled ByteBuffers
    * Add metrics of total and ongoing upload, including tracking what quantity 
of the outstanding block data has actually been uploaded.

commit a068598c5c89e46f98ab05deb23e43d38556e424
Author: Steve Loughran <[email protected]>
Date:   2016-09-07T14:12:11Z

    HADOOP-13560 ongoing work on disk uploads at 2+ GB scale.

commit 9229c642a0380e6c8bb225e89d688fef1e9cb05c
Author: Steve Loughran <[email protected]>
Date:   2016-09-07T15:12:16Z

    HADOOP-13560 complete merge with branch-2. Milestone: 1GB file round trip @ 
1.57 MB/s

----


> S3A to support huge file writes and operations -with tests
> ----------------------------------------------------------
>
>                 Key: HADOOP-13560
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13560
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.9.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>         Attachments: HADOOP-13560-branch-2-001.patch, 
> HADOOP-13560-branch-2-002.patch
>
>
> An AWS SDK [issue|https://github.com/aws/aws-sdk-java/issues/367] highlights 
> that metadata isn't copied on large copies.
> 1. Add a test to do that large copy/rname and verify that the copy really 
> works
> 2. Verify that metadata makes it over.
> Verifying large file rename is important on its own, as it is needed for very 
> large commit operations for committers using rename



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-13560) S3A to support huge file writes and operations -with tests

Reply via email to