[
https://issues.apache.org/jira/browse/HADOOP-13560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15473510#comment-15473510
]
ASF GitHub Bot commented on HADOOP-13560:
-----------------------------------------
GitHub user steveloughran opened a pull request:
https://github.com/apache/hadoop/pull/125
HADOOP-13560 S3A to support huge file writes and operations -with tests
Adds
## Scale tests for S3A huge file support;
- always running at the MB size (maybe best to make optional)
-configurable to bigger sizes in the auth-keys XML or in the build
`-Dfs.s3a.scale.test.huge.filesize=1000`
- limited to upload, seek, read, rename, delete. The JUnit test cases are
explicltly set up to run in order here.
## New scalable output stream for writing, `S3ABlockOutputStream`
-always saves in incremental blocks as writes proceed, block size ==
partition size.
-supports Fast output stream memory buffer code (for regression testing)
-supports a back end which buffers blocks in files, using RR disk
allocation. As such, write/read bandwidth is limited to aggregate HDD bandwidth.
-adding extra failure resilience as testing throws up failure conditions
(network timeouts, no-response from server on multipart commit, etc).
-adding instrumentation, including using callbacks from AWS SDK to update
gauges and counters (in progress)
What we have here is essentially something that can replace the classic
"save to file, upload at the end" stream and the fast "store it all in RAM and
hope there's space" stream. It should offer incremental upload for faster
output of larger files compared the classic file stream, with the scaleability
the fast one lacks. And the instrumentation to show what's happening.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/steveloughran/hadoop s3/HADOOP-13560-5GB-blobs
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/hadoop/pull/125.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #125
----
commit d21600ead50aaafed1611b00206991c7d2c5934f
Author: Steve Loughran <[email protected]>
Date: 2016-08-30T15:48:14Z
HADOOP-13560 adding test for create/copy 5GB files
commit 13b0544fffe7feb3e6d7404c90f222f1ae6644bb
Author: Steve Loughran <[email protected]>
Date: 2016-08-31T11:29:28Z
HADOOP-13560 tuning test scale and timeouts
commit fb6a70c8d2b36c66d7b3ae732d9afd80b436a512
Author: Steve Loughran <[email protected]>
Date: 2016-08-31T13:21:23Z
HADOOP-13560 scale tests take maven build arguments
commit d09aad6377fc37912d1c47355a191bc3279a4016
Author: Steve Loughran <[email protected]>
Date: 2016-08-31T13:33:47Z
HADOOP-13567 S3AFileSystem to override getStoragetStatistics() and so serve
up its statistics
commit e8afc25621e3552b80463084df29f785ecde6807
Author: Steve Loughran <[email protected]>
Date: 2016-08-31T13:34:26Z
HADOOP-13566 NPE in S3AFastOutputStream.write
commit dfa90a08d18b7cda8c135ba8b838929a28784a47
Author: Steve Loughran <[email protected]>
Date: 2016-08-31T14:31:39Z
HADOOP-13560 use STest as prefix for scale tests
commit 27365023e9363763c300e81bdefcb45887131ce4
Author: Steve Loughran <[email protected]>
Date: 2016-08-31T17:06:51Z
HADOOP-13560 test improvements
commit a46781589ae8cedbdfeabb92fcc1ca83afc21b4c
Author: Steve Loughran <[email protected]>
Date: 2016-08-31T17:07:29Z
HADOOP-13560 fix typo in the name of a statistic
commit cfdb0f0dbe8231a63046ba19900ea46645462bcb
Author: Steve Loughran <[email protected]>
Date: 2016-08-31T17:08:42Z
HADOOP-13569 S3AFastOutputStream to take ProgressListener in file create()
commit 8ffd7a90fff7a5ed460b0396232d5322a06f8e59
Author: Steve Loughran <[email protected]>
Date: 2016-09-01T17:05:01Z
HADOOP-13560 lots of improvement in test and monitoring of what is going on
inside S3A, including a gauge of active request counts. +more troubleshooting
docs. The fast output stream will retry on errors
commit 750e9462b7bd267915f9d91cbab0cd0ba51f1c41
Author: Steve Loughran <[email protected]>
Date: 2016-08-28T11:38:40Z
HADOOP-13531 S3 output stream allocator to round-robin directories
commit 51c27278bcfa067109efa702deed6890db677895
Author: Steve Loughran <[email protected]>
Date: 2016-09-05T17:48:36Z
HADOOP-13560 WiP: adding new incremental output stream
commit e1ce5a804a1c5d0afddf21362fe5a8d7d5179c58
Author: Steve Loughran <[email protected]>
Date: 2016-09-06T13:46:50Z
HADOOP-13560 data block design is coalescing and memory buffer writes are
passing tests
commit db1ed581b26c0320209017a09e77754638e7c42a
Author: Steve Loughran <[email protected]>
Date: 2016-09-06T19:58:14Z
HADOOP-13560 patch 002
block streaming is in, testing at moderate scale <100 MB.
you can choose for buffer-by-ram (current fast uploader) or buffer by HDD;
in a test using SSD & remote S3, I got ~1.38MB/s bandwidth, got something
similar 1.44 on RAM. But: we shouldn't run out off heap on the HDD option. RAM
buffering uses existing ByteArrays, to ease source code migration off
FastUpload (which is still there, for now).
* I do plan to add pooled ByteBuffers
* Add metrics of total and ongoing upload, including tracking what quantity
of the outstanding block data has actually been uploaded.
commit a068598c5c89e46f98ab05deb23e43d38556e424
Author: Steve Loughran <[email protected]>
Date: 2016-09-07T14:12:11Z
HADOOP-13560 ongoing work on disk uploads at 2+ GB scale.
commit 9229c642a0380e6c8bb225e89d688fef1e9cb05c
Author: Steve Loughran <[email protected]>
Date: 2016-09-07T15:12:16Z
HADOOP-13560 complete merge with branch-2. Milestone: 1GB file round trip @
1.57 MB/s
----
> S3A to support huge file writes and operations -with tests
> ----------------------------------------------------------
>
> Key: HADOOP-13560
> URL: https://issues.apache.org/jira/browse/HADOOP-13560
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 2.9.0
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Minor
> Attachments: HADOOP-13560-branch-2-001.patch,
> HADOOP-13560-branch-2-002.patch
>
>
> An AWS SDK [issue|https://github.com/aws/aws-sdk-java/issues/367] highlights
> that metadata isn't copied on large copies.
> 1. Add a test to do that large copy/rname and verify that the copy really
> works
> 2. Verify that metadata makes it over.
> Verifying large file rename is important on its own, as it is needed for very
> large commit operations for committers using rename
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]