[ https://issues.apache.org/jira/browse/HADOOP-13560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574770#comment-15574770 ]
ASF GitHub Bot commented on HADOOP-13560: ----------------------------------------- Github user thodemoor commented on a diff in the pull request: https://github.com/apache/hadoop/pull/130#discussion_r83385370 --- Diff: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md --- @@ -881,40 +881,362 @@ Seoul If the wrong endpoint is used, the request may fail. This may be reported as a 301/redirect error, or as a 400 Bad Request. -### S3AFastOutputStream - **Warning: NEW in hadoop 2.7. UNSTABLE, EXPERIMENTAL: use at own risk** - <property> - <name>fs.s3a.fast.upload</name> - <value>false</value> - <description>Upload directly from memory instead of buffering to - disk first. Memory usage and parallelism can be controlled as up to - fs.s3a.multipart.size memory is consumed for each (part)upload actively - uploading (fs.s3a.threads.max) or queueing (fs.s3a.max.total.tasks)</description> - </property> - <property> - <name>fs.s3a.fast.buffer.size</name> - <value>1048576</value> - <description>Size (in bytes) of initial memory buffer allocated for an - upload. No effect if fs.s3a.fast.upload is false.</description> - </property> +### <a name="s3a_fast_upload"></a>Stabilizing: S3A Fast Upload + + +**New in Hadoop 2.7; significantly enhanced in Hadoop 2.9** + + +Because of the nature of the S3 object store, data written to an S3A `OutputStream` +is not written incrementally —instead, by default, it is buffered to disk +until the stream is closed in its `close()` method. + +This can make output slow: + +* The execution time for `OutputStream.close()` is proportional to the amount of data +buffered and inversely proportional to the bandwidth. That is `O(data/bandwidth)`. +* The bandwidth is that available from the host to S3: other work in the same +process, server or network at the time of upload may increase the upload time, +hence the duration of the `close()` call. +* If a process uploading data fails before `OutputStream.close()` is called, +all data is lost. +* The disks hosting temporary directories defined in `fs.s3a.buffer.dir` must +have the capacity to store the entire buffered file. + +Put succinctly: the further the process is from the S3 endpoint, or the smaller +the EC-hosted VM is, the longer it will take work to complete. + +This can create problems in application code: + +* Code often assumes that the `close()` call is fast; + the delays can create bottlenecks in operations. +* Very slow uploads sometimes cause applications to time out. (generally, +threads blocking during the upload stop reporting progress, so trigger timeouts) +* Streaming very large amounts of data may consume all disk space before the upload begins. + + +Work to addess this began in Hadoop 2.7 with the `S3AFastOutputStream` +[HADOOP-11183](https://issues.apache.org/jira/browse/HADOOP-11183), and +has continued with ` S3ABlockOutputStream` +[HADOOP-13560](https://issues.apache.org/jira/browse/HADOOP-13560). + + +This adds an alternative output stream, "S3a Fast Upload" which: + +1. Always uploads large files as blocks with the size set by + `fs.s3a.multipart.size`. That is: the threshold at which multipart uploads + begin and the size of each upload are identical. +1. Buffers blocks to disk (default) or in on-heap or off-heap memory. +1. Uploads blocks in parallel in background threads. +1. Begins uploading blocks as soon as the buffered data exceeds this partition + size. +1. When buffering data to disk, uses the directory/directories listed in + `fs.s3a.buffer.dir`. The size of data which can be buffered is limited + to the available disk space. +1. Generates output statistics as metrics on the filesystem, including + statistics of active and pending block uploads. +1. Has the time to `close()` set by the amount of remaning data to upload, rather + than the total size of the file. + +With incremental writes of blocks, "S3A fast upload" offers an upload +time at least as fast as the "classic" mechanism, with significant benefits +on long-lived output streams, and when very large amounts of data are generated. +The in memory buffering mechanims may also offer speedup when running adjacent to +S3 endpoints, as disks are not used for intermediate data storage. + + +```xml +<property> + <name>fs.s3a.fast.upload</name> + <value>true</value> + <description> + Use the incremental block upload mechanism with + the buffering mechanism set in fs.s3a.fast.upload.buffer. + The number of threads performing uploads in the filesystem is defined + by fs.s3a.threads.max; the queue of waiting uploads limited by + fs.s3a.max.total.tasks. + The size of each buffer is set by fs.s3a.multipart.size. + </description> +</property> + +<property> + <name>fs.s3a.fast.upload.buffer</name> + <value>disk</value> + <description> + The buffering mechanism to use when using S3A fast upload + (fs.s3a.fast.upload=true). Values: disk, array, bytebuffer. + This configuration option has no effect if fs.s3a.fast.upload is false. + + "disk" will use the directories listed in fs.s3a.buffer.dir as + the location(s) to save data prior to being uploaded. + + "array" uses arrays in the JVM heap + + "bytebuffer" uses off-heap memory within the JVM. + + Both "array" and "bytebuffer" will consume memory in a single stream up to the number + of blocks set by: + + fs.s3a.multipart.size * fs.s3a.fast.upload.active.blocks. + + If using either of these mechanisms, keep this value low + + The total number of threads performing work across all threads is set by + fs.s3a.threads.max, with fs.s3a.max.total.tasks values setting the number of queued + work items. + </description> +</property> + +<property> + <name>fs.s3a.multipart.size</name> + <value>104857600</value> + <description> + How big (in bytes) to split upload or copy operations up into. + </description> +</property> + +<property> + <name>fs.s3a.fast.upload.active.blocks</name> + <value>8</value> + <description> + Maximum Number of blocks a single output stream can have + active (uploading, or queued to the central FileSystem + instance's pool of queued operations. + + This stops a single stream overloading the shared thread pool. + </description> +</property> +``` + +**Notes** + +* If the amount of data written to a stream is below that set in `fs.s3a.multipart.size`, +the upload is performed in the `OutputStream.close()` operation —as with +the original output stream. + +* The published Hadoop metrics monitor include live queue length and +upload operation counts, so identifying when there is a backlog of work/ +a mismatch between data generation rates and network bandwidth. Per-stream +statistics can also be logged by calling `toString()` on the current stream. + +* Incremental writes are not visible; the object can only be listed +or read when the multipart operation completes in the `close()` call, which +will block until the upload is completed. + + +#### <a name="s3a_fast_upload_disk"></a>Fast Upload with Disk Buffers `fs.s3a.fast.upload.buffer=disk` + +When `fs.s3a.fast.upload.buffer` is set to `disk`, all data is buffered +to local hard disks prior to upload. This minimizes the amount of memory +consumed, and so eliminates heap size as the limiting factor in queued uploads +—exactly as the original "direct to disk" buffering used when +`fs.s3a.fast.upload=false`. + + +```xml +<property> + <name>fs.s3a.fast.upload</name> + <value>true</value> +</property> + +<property> + <name>fs.s3a.fast.upload.buffer</name> + <value>disk</value> +</property> + +``` + + +#### <a name="s3a_fast_upload_bytebuffer"></a>Fast Upload with ByteBuffers: `fs.s3a.fast.upload.buffer=bytebuffer` + +When `fs.s3a.fast.upload.buffer` is set to `bytebuffer`, all data is buffered +in "Direct" ByteBuffers prior to upload. This *may* be faster than buffering to disk, +and, if disk space is small (for example, tiny EC2 VMs), there may not +be much disk space to buffer with. + +The ByteBuffers are created in the memory of the JVM, but not in the Java Heap itself. +The amount of data which can be buffered is +limited by the Java runtime, the operating system, and, for YARN applications, +the amount of memory requested for each container. + +The slower the write bandwidth to S3, the greater the risk of running out +of memory. + + +```xml +<property> + <name>fs.s3a.fast.upload</name> + <value>true</value> +</property> + +<property> + <name>fs.s3a.fast.upload.buffer</name> + <value>bytebuffer</value> +</property> +``` + +#### <a name="s3a_fast_upload_array"></a>Fast Upload with Arrays: `fs.s3a.fast.upload.buffer=array` + +When `fs.s3a.fast.upload.buffer` is set to `array`, all data is buffered +in byte arrays in the JVM's heap prior to upload. +This *may* be faster than buffering to disk. + +This `array` option is similar to the in-memory-only stream offered in +Hadoop 2.7 with `fs.s3a.fast.upload=true` + +The amount of data which can be buffered is limited by the available +size of the JVM heap heap. The slower the write bandwidth to S3, the greater +the risk of heap overflows. + +```xml +<property> + <name>fs.s3a.fast.upload</name> + <value>true</value> +</property> + +<property> + <name>fs.s3a.fast.upload.buffer</name> + <value>array</value> +</property> + +``` +#### <a name="s3a_fast_upload_thread_tuning"></a>S3A Fast Upload Thread Tuning + +Both the [Array](#s3a_fast_upload_array) and [Byte buffer](#s3a_fast_upload_bytebuffer) +buffer mechanisms can consume very large amounts of memory, on-heap or +off-heap respectively. The [disk buffer](#s3a_fast_upload_disk) mechanism +does not use much memory up, but will consume hard disk capacity. + +If there are many output streams being written to in a single process, the +amount of memory or disk used is the multiple of all stream's active memory/disk use. + +Careful tuning may be needed to reduce the risk of running out memory, especially +if the data is buffered in memory. + +There are a number parameters which can be tuned: + +1. The total number of threads available in the filesystem for data +uploads *or any other queued filesystem operation*. This is set in +`fs.s3a.threads.max` + +1. The number of operations which can be queued for execution:, *awaiting +a thread*: `fs.s3a.max.total.tasks` + +1. The number of blocks which a single output stream can have active, +that is: being uploaded by a thread, or queued in the filesystem thread queue: +`fs.s3a.fast.upload.active.blocks` + +1. How long an idle thread can stay in the thread pool before it is retired: `fs.s3a.threads.keepalivetime` + + +When the maximum allowed number of active blocks of a single stream is reached, +no more blocks can be uploaded from that stream until one or more of those active +blocks' uploads completes. That is: a `write()` call which would trigger an upload +of a now full datablock, will instead block until there is capacity in the queue. + +How does that come together? + +* As the pool of threads set in `fs.s3a.threads.max` is shared (and intended +to be used across all threads), a larger number here can allow for more +parallel operations. However, as uploads require network bandwidth, adding more +threads does not guarantee speedup. + +* The extra queue of tasks for the thread pool (`fs.s3a.max.total.tasks`) +covers all ongoing background S3A operations (future plans include: parallelized +rename operations, asynchronous directory operations). + +* When using memory buffering, a small value of `fs.s3a.fast.upload.active.blocks` +limits the amount of memory which can be consumed per stream. + +* When using disk buffering a larger value of `fs.s3a.fast.upload.active.blocks` +does not consume much memory. But it may result in a large number of blocks to +compete with other filesystem operations. + + +We recommend a low value of `fs.s3a.fast.upload.active.blocks`; enough +to start background upload without overloading other parts of the system, +then experiment to see if higher values deliver more throughtput —especially +from VMs running on EC2. + +```xml + +<property> + <name>fs.s3a.fast.upload.active.blocks</name> + <value>4</value> + <description> + Maximum Number of blocks a single output stream can have + active (uploading, or queued to the central FileSystem + instance's pool of queued operations. + + This stops a single stream overloading the shared thread pool. + </description> +</property> + +<property> + <name>fs.s3a.threads.max</name> + <value>10</value> + <description>The total number of threads available in the filesystem for data + uploads *or any other queued filesystem operation*.</description> +</property> + +<property> + <name>fs.s3a.max.total.tasks</name> + <value>5</value> + <description>The number of operations which can be queued for execution</description> +</property> + +<property> + <name>fs.s3a.threads.keepalivetime</name> + <value>60</value> + <description>Number of seconds a thread can be idle before being + terminated.</description> +</property> + +``` + + +#### <a name="s3a_multipart_purge"></a>Cleaning up After Incremental Upload Failures: `fs.s3a.multipart.purge` + + +If an incremental streaming operation is interrupted, there may be +intermediate partitions uploaded to S3 —data which will be billed for. + +These charges can be reduced by enabling `fs.s3a.multipart.purge`, +and setting a purge time in seconds, such as 86400 seconds —24 hours, after +which the S3 service automatically deletes outstanding multipart --- End diff -- To me, the wording here gives the impression this is a server-side operation but the purging happens on the client by listing all uploads and then sending a delete call with the ones to be purged. Consequently, this can cause a (slight) delay when instantiating an s3a FS instance and there are lots of active uploads (to purge). > S3ABlockOutputStream to support huge (many GB) file writes > ---------------------------------------------------------- > > Key: HADOOP-13560 > URL: https://issues.apache.org/jira/browse/HADOOP-13560 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 > Affects Versions: 2.9.0 > Reporter: Steve Loughran > Assignee: Steve Loughran > Attachments: HADOOP-13560-branch-2-001.patch, > HADOOP-13560-branch-2-002.patch, HADOOP-13560-branch-2-003.patch, > HADOOP-13560-branch-2-004.patch > > > An AWS SDK [issue|https://github.com/aws/aws-sdk-java/issues/367] highlights > that metadata isn't copied on large copies. > 1. Add a test to do that large copy/rname and verify that the copy really > works > 2. Verify that metadata makes it over. > Verifying large file rename is important on its own, as it is needed for very > large commit operations for committers using rename -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org