ASF GitHub Bot commented on HADOOP-13560:
Github user thodemoor commented on a diff in the pull request:
@@ -881,40 +881,362 @@ Seoul
If the wrong endpoint is used, the request may fail. This may be reported
as a 301/redirect error,
or as a 400 Bad Request.
- **Warning: NEW in hadoop 2.7. UNSTABLE, EXPERIMENTAL: use at own risk**
- <description>Upload directly from memory instead of buffering to
- disk first. Memory usage and parallelism can be controlled as up to
- fs.s3a.multipart.size memory is consumed for each (part)upload
- uploading (fs.s3a.threads.max) or queueing
- <description>Size (in bytes) of initial memory buffer allocated for
- upload. No effect if fs.s3a.fast.upload is false.</description>
+### <a name="s3a_fast_upload"></a>Stabilizing: S3A Fast Upload
+**New in Hadoop 2.7; significantly enhanced in Hadoop 2.9**
+Because of the nature of the S3 object store, data written to an S3A
+is not written incrementally —instead, by default, it is buffered to disk
+until the stream is closed in its `close()` method.
+This can make output slow:
+* The execution time for `OutputStream.close()` is proportional to the
amount of data
+buffered and inversely proportional to the bandwidth. That is
+* The bandwidth is that available from the host to S3: other work in the
+process, server or network at the time of upload may increase the upload
+hence the duration of the `close()` call.
+* If a process uploading data fails before `OutputStream.close()` is
+all data is lost.
+* The disks hosting temporary directories defined in `fs.s3a.buffer.dir`
+have the capacity to store the entire buffered file.
+Put succinctly: the further the process is from the S3 endpoint, or the
+the EC-hosted VM is, the longer it will take work to complete.
+This can create problems in application code:
+* Code often assumes that the `close()` call is fast;
+ the delays can create bottlenecks in operations.
+* Very slow uploads sometimes cause applications to time out. (generally,
+threads blocking during the upload stop reporting progress, so trigger
+* Streaming very large amounts of data may consume all disk space before
the upload begins.
+Work to addess this began in Hadoop 2.7 with the `S3AFastOutputStream`
+has continued with ` S3ABlockOutputStream`
+This adds an alternative output stream, "S3a Fast Upload" which:
+1. Always uploads large files as blocks with the size set by
+ `fs.s3a.multipart.size`. That is: the threshold at which multipart
+ begin and the size of each upload are identical.
+1. Buffers blocks to disk (default) or in on-heap or off-heap memory.
+1. Uploads blocks in parallel in background threads.
+1. Begins uploading blocks as soon as the buffered data exceeds this
+1. When buffering data to disk, uses the directory/directories listed in
+ `fs.s3a.buffer.dir`. The size of data which can be buffered is limited
+ to the available disk space.
+1. Generates output statistics as metrics on the filesystem, including
+ statistics of active and pending block uploads.
+1. Has the time to `close()` set by the amount of remaning data to
+ than the total size of the file.
+With incremental writes of blocks, "S3A fast upload" offers an upload
+time at least as fast as the "classic" mechanism, with significant benefits
+on long-lived output streams, and when very large amounts of data are
+The in memory buffering mechanims may also offer speedup when running
+S3 endpoints, as disks are not used for intermediate data storage.
+ Use the incremental block upload mechanism with
+ the buffering mechanism set in fs.s3a.fast.upload.buffer.
+ The number of threads performing uploads in the filesystem is defined
+ by fs.s3a.threads.max; the queue of waiting uploads limited by
+ The size of each buffer is set by fs.s3a.multipart.size.
+ The buffering mechanism to use when using S3A fast upload
+ (fs.s3a.fast.upload=true). Values: disk, array, bytebuffer.
+ This configuration option has no effect if fs.s3a.fast.upload is false.
+ "disk" will use the directories listed in fs.s3a.buffer.dir as
+ the location(s) to save data prior to being uploaded.
+ "array" uses arrays in the JVM heap
+ "bytebuffer" uses off-heap memory within the JVM.
+ Both "array" and "bytebuffer" will consume memory in a single stream
up to the number
+ of blocks set by:
+ fs.s3a.multipart.size * fs.s3a.fast.upload.active.blocks.
+ If using either of these mechanisms, keep this value low
+ The total number of threads performing work across all threads is set
+ fs.s3a.threads.max, with fs.s3a.max.total.tasks values setting the
number of queued
+ work items.
+ How big (in bytes) to split upload or copy operations up into.
+ Maximum Number of blocks a single output stream can have
+ active (uploading, or queued to the central FileSystem
+ instance's pool of queued operations.
+ This stops a single stream overloading the shared thread pool.
+* If the amount of data written to a stream is below that set in
+the upload is performed in the `OutputStream.close()` operation —as with
+the original output stream.
+* The published Hadoop metrics monitor include live queue length and
+upload operation counts, so identifying when there is a backlog of work/
+a mismatch between data generation rates and network bandwidth. Per-stream
+statistics can also be logged by calling `toString()` on the current
+* Incremental writes are not visible; the object can only be listed
+or read when the multipart operation completes in the `close()` call, which
+will block until the upload is completed.
+#### <a name="s3a_fast_upload_disk"></a>Fast Upload with Disk Buffers
+When `fs.s3a.fast.upload.buffer` is set to `disk`, all data is buffered
+to local hard disks prior to upload. This minimizes the amount of memory
+consumed, and so eliminates heap size as the limiting factor in queued
+—exactly as the original "direct to disk" buffering used when
+#### <a name="s3a_fast_upload_bytebuffer"></a>Fast Upload with
+When `fs.s3a.fast.upload.buffer` is set to `bytebuffer`, all data is
+in "Direct" ByteBuffers prior to upload. This *may* be faster than
buffering to disk,
+and, if disk space is small (for example, tiny EC2 VMs), there may not
+be much disk space to buffer with.
+The ByteBuffers are created in the memory of the JVM, but not in the Java
+The amount of data which can be buffered is
+limited by the Java runtime, the operating system, and, for YARN
+the amount of memory requested for each container.
+The slower the write bandwidth to S3, the greater the risk of running out
--- End diff --
Memory usage is bounded to ...
> S3ABlockOutputStream to support huge (many GB) file writes
> Key: HADOOP-13560
> URL: https://issues.apache.org/jira/browse/HADOOP-13560
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 2.9.0
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Attachments: HADOOP-13560-branch-2-001.patch,
> HADOOP-13560-branch-2-002.patch, HADOOP-13560-branch-2-003.patch,
> An AWS SDK [issue|https://github.com/aws/aws-sdk-java/issues/367] highlights
> that metadata isn't copied on large copies.
> 1. Add a test to do that large copy/rname and verify that the copy really
> 2. Verify that metadata makes it over.
> Verifying large file rename is important on its own, as it is needed for very
> large commit operations for committers using rename
This message was sent by Atlassian JIRA
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org