[ 
https://issues.apache.org/jira/browse/HADOOP-13560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575322#comment-15575322
 ] 

ASF GitHub Bot commented on HADOOP-13560:
-----------------------------------------

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/hadoop/pull/130#discussion_r83420236
  
    --- Diff: 
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md ---
    @@ -881,40 +881,362 @@ Seoul
     If the wrong endpoint is used, the request may fail. This may be reported 
as a 301/redirect error,
     or as a 400 Bad Request.
     
    -### S3AFastOutputStream
    - **Warning: NEW in hadoop 2.7. UNSTABLE, EXPERIMENTAL: use at own risk**
     
    -    <property>
    -      <name>fs.s3a.fast.upload</name>
    -      <value>false</value>
    -      <description>Upload directly from memory instead of buffering to
    -      disk first. Memory usage and parallelism can be controlled as up to
    -      fs.s3a.multipart.size memory is consumed for each (part)upload 
actively
    -      uploading (fs.s3a.threads.max) or queueing 
(fs.s3a.max.total.tasks)</description>
    -    </property>
     
    -    <property>
    -      <name>fs.s3a.fast.buffer.size</name>
    -      <value>1048576</value>
    -      <description>Size (in bytes) of initial memory buffer allocated for 
an
    -      upload. No effect if fs.s3a.fast.upload is false.</description>
    -    </property>
    +### <a name="s3a_fast_upload"></a>Stabilizing: S3A Fast Upload
    +
    +
    +**New in Hadoop 2.7; significantly enhanced in Hadoop 2.9**
    +
    +
    +Because of the nature of the S3 object store, data written to an S3A 
`OutputStream` 
    +is not written incrementally —instead, by default, it is buffered to disk
    +until the stream is closed in its `close()` method. 
    +
    +This can make output slow:
    +
    +* The execution time for `OutputStream.close()` is proportional to the 
amount of data
    +buffered and inversely proportional to the bandwidth. That is 
`O(data/bandwidth)`.
    +* The bandwidth is that available from the host to S3: other work in the 
same
    +process, server or network at the time of upload may increase the upload 
time,
    +hence the duration of the `close()` call.
    +* If a process uploading data fails before `OutputStream.close()` is 
called,
    +all data is lost.
    +* The disks hosting temporary directories defined in `fs.s3a.buffer.dir` 
must
    +have the capacity to store the entire buffered file.
    +
    +Put succinctly: the further the process is from the S3 endpoint, or the 
smaller
    +the EC-hosted VM is, the longer it will take work to complete.
    +
    +This can create problems in application code:
    +
    +* Code often assumes that the `close()` call is fast;
    + the delays can create bottlenecks in operations.
    +* Very slow uploads sometimes cause applications to time out. (generally,
    +threads blocking during the upload stop reporting progress, so trigger 
timeouts)
    +* Streaming very large amounts of data may consume all disk space before 
the upload begins.
    +
    +
    +Work to addess this began in Hadoop 2.7 with the `S3AFastOutputStream`
    +[HADOOP-11183](https://issues.apache.org/jira/browse/HADOOP-11183), and
    +has continued with ` S3ABlockOutputStream`
    +[HADOOP-13560](https://issues.apache.org/jira/browse/HADOOP-13560).
    +
    +
    +This adds an alternative output stream, "S3a Fast Upload" which:
    +
    +1.  Always uploads large files as blocks with the size set by
    +    `fs.s3a.multipart.size`. That is: the threshold at which multipart 
uploads
    +    begin and the size of each upload are identical.
    +1.  Buffers blocks to disk (default) or in on-heap or off-heap memory.
    +1.  Uploads blocks in parallel in background threads.
    +1.  Begins uploading blocks as soon as the buffered data exceeds this 
partition
    +    size.
    +1.  When buffering data to disk, uses the directory/directories listed in
    +    `fs.s3a.buffer.dir`. The size of data which can be buffered is limited
    +    to the available disk space.
    +1.  Generates output statistics as metrics on the filesystem, including
    +    statistics of active and pending block uploads.
    +1.  Has the time to `close()` set by the amount of remaning data to 
upload, rather
    +    than the total size of the file.
    +
    +With incremental writes of blocks, "S3A fast upload" offers an upload
    +time at least as fast as the "classic" mechanism, with significant benefits
    +on long-lived output streams, and when very large amounts of data are 
generated.
    +The in memory buffering mechanims may also  offer speedup when running 
adjacent to
    +S3 endpoints, as disks are not used for intermediate data storage.
    +
    +
    +```xml
    +<property>
    +  <name>fs.s3a.fast.upload</name>
    +  <value>true</value>
    +  <description>
    +    Use the incremental block upload mechanism with
    +    the buffering mechanism set in fs.s3a.fast.upload.buffer.
    +    The number of threads performing uploads in the filesystem is defined
    +    by fs.s3a.threads.max; the queue of waiting uploads limited by
    +    fs.s3a.max.total.tasks.
    +    The size of each buffer is set by fs.s3a.multipart.size.
    +  </description>
    +</property>
    +
    +<property>
    +  <name>fs.s3a.fast.upload.buffer</name>
    +  <value>disk</value>
    +  <description>
    +    The buffering mechanism to use when using S3A fast upload
    +    (fs.s3a.fast.upload=true). Values: disk, array, bytebuffer.
    +    This configuration option has no effect if fs.s3a.fast.upload is false.
    +
    +    "disk" will use the directories listed in fs.s3a.buffer.dir as
    +    the location(s) to save data prior to being uploaded.
    +
    +    "array" uses arrays in the JVM heap
    +
    +    "bytebuffer" uses off-heap memory within the JVM.
    +
    +    Both "array" and "bytebuffer" will consume memory in a single stream 
up to the number
    +    of blocks set by:
    +
    +        fs.s3a.multipart.size * fs.s3a.fast.upload.active.blocks.
    +
    +    If using either of these mechanisms, keep this value low
    +
    +    The total number of threads performing work across all threads is set 
by
    +    fs.s3a.threads.max, with fs.s3a.max.total.tasks values setting the 
number of queued
    +    work items.
    +  </description>
    +</property>
    +  
    +<property>
    +  <name>fs.s3a.multipart.size</name>
    +  <value>104857600</value>
    +  <description>
    +  How big (in bytes) to split upload or copy operations up into.
    +  </description>
    +</property>
    +
    +<property>
    +  <name>fs.s3a.fast.upload.active.blocks</name>
    +  <value>8</value>
    +  <description>
    +    Maximum Number of blocks a single output stream can have
    +    active (uploading, or queued to the central FileSystem
    +    instance's pool of queued operations.
    +
    +    This stops a single stream overloading the shared thread pool.
    +  </description>
    +</property>
    +```
    +
    +**Notes**
    +
    +* If the amount of data written to a stream is below that set in 
`fs.s3a.multipart.size`,
    +the upload is performed in the `OutputStream.close()` operation —as with
    +the original output stream.
    +
    +* The published Hadoop metrics monitor include live queue length and
    +upload operation counts, so identifying when there is a backlog of work/
    +a mismatch between data generation rates and network bandwidth. Per-stream
    +statistics can also be logged by calling `toString()` on the current 
stream.
    +
    +* Incremental writes are not visible; the object can only be listed
    +or read when the multipart operation completes in the `close()` call, which
    +will block until the upload is completed.
    +
    +
    +#### <a name="s3a_fast_upload_disk"></a>Fast Upload with Disk Buffers 
`fs.s3a.fast.upload.buffer=disk`
    +
    +When `fs.s3a.fast.upload.buffer` is set to `disk`, all data is buffered
    +to local hard disks prior to upload. This minimizes the amount of memory
    +consumed, and so eliminates heap size as the limiting factor in queued 
uploads
    +—exactly as the original "direct to disk" buffering used when
    +`fs.s3a.fast.upload=false`.
    +
    +
    +```xml
    +<property>
    +  <name>fs.s3a.fast.upload</name>
    +  <value>true</value>
    +</property>
    +  
    +<property>
    +  <name>fs.s3a.fast.upload.buffer</name>
    +  <value>disk</value>
    +</property>
    +
    +```
    +
    +
    +#### <a name="s3a_fast_upload_bytebuffer"></a>Fast Upload with 
ByteBuffers: `fs.s3a.fast.upload.buffer=bytebuffer`
    +
    +When `fs.s3a.fast.upload.buffer` is set to `bytebuffer`, all data is 
buffered
    +in "Direct" ByteBuffers prior to upload. This *may* be faster than 
buffering to disk,
    +and, if disk space is small (for example, tiny EC2 VMs), there may not
    +be much disk space to buffer with.
    +
    +The ByteBuffers are created in the memory of the JVM, but not in the Java 
Heap itself.
    +The amount of data which can be buffered is
    +limited by the Java runtime, the operating system, and, for YARN 
applications,
    +the amount of memory requested for each container.
    +
    +The slower the write bandwidth to S3, the greater the risk of running out
    +of memory.
    +
    +
    +```xml
    +<property>
    +  <name>fs.s3a.fast.upload</name>
    +  <value>true</value>
    +</property>
    +  
    +<property>
    +  <name>fs.s3a.fast.upload.buffer</name>
    +  <value>bytebuffer</value>
    +</property>
    +```
    +
    +#### <a name="s3a_fast_upload_array"></a>Fast Upload with Arrays: 
`fs.s3a.fast.upload.buffer=array`
    +
    +When `fs.s3a.fast.upload.buffer` is set to `array`, all data is buffered
    +in byte arrays in the JVM's heap prior to upload.
    +This *may* be faster than buffering to disk.
    +
    +This `array` option is similar to the in-memory-only stream offered in
    +Hadoop 2.7 with `fs.s3a.fast.upload=true`
    +
    +The amount of data which can be buffered is limited by the available
    +size of the JVM heap heap. The slower the write bandwidth to S3, the 
greater
    +the risk of heap overflows.
    +
    +```xml
    +<property>
    +  <name>fs.s3a.fast.upload</name>
    +  <value>true</value>
    +</property>
    +  
    +<property>
    +  <name>fs.s3a.fast.upload.buffer</name>
    +  <value>array</value>
    +</property>
    +
    +```
    +#### <a name="s3a_fast_upload_thread_tuning"></a>S3A Fast Upload Thread 
Tuning
    +
    +Both the [Array](#s3a_fast_upload_array) and [Byte 
buffer](#s3a_fast_upload_bytebuffer)
    +buffer mechanisms can consume very large amounts of memory, on-heap or
    +off-heap respectively. The [disk buffer](#s3a_fast_upload_disk) mechanism
    +does not use much memory up, but will consume hard disk capacity.
    +
    +If there are many output streams being written to in a single process, the
    +amount of memory or disk used is the multiple of all stream's active 
memory/disk use.
    +
    +Careful tuning may be needed to reduce the risk of running out memory, 
especially
    +if the data is buffered in memory.
    +
    +There are a number parameters which can be tuned: 
    +
    +1. The total number of threads available in the filesystem for data
    +uploads *or any other queued filesystem operation*. This is set in
    +`fs.s3a.threads.max`
    +
    +1. The number of operations which can be queued for execution:, *awaiting
    +a thread*: `fs.s3a.max.total.tasks`
    +
    +1. The number of blocks which a single output stream can have active,
    +that is: being uploaded by a thread, or queued in the filesystem thread 
queue:
    +`fs.s3a.fast.upload.active.blocks`
    +
    +1. How long an idle thread can stay in the thread pool before it is 
retired: `fs.s3a.threads.keepalivetime`
    +
    +
    +When the maximum allowed number of active blocks of a single stream is 
reached,
    +no more blocks can be uploaded from that stream until one or more of those 
active
    +blocks' uploads completes. That is: a `write()` call which would trigger 
an upload
    +of a now full datablock, will instead block until there is capacity in the 
queue.
    +
    +How does that come together? 
    +
    +* As the pool of threads set in `fs.s3a.threads.max` is shared (and 
intended
    +to be used across all threads), a larger number here can allow for more
    +parallel operations. However, as uploads require network bandwidth, adding 
more
    +threads does not guarantee speedup.
    +
    +* The extra queue of tasks for the thread pool (`fs.s3a.max.total.tasks`)
    +covers all ongoing background S3A operations (future plans include: 
parallelized
    +rename operations, asynchronous directory operations).
    +
    +* When using memory buffering, a small value of 
`fs.s3a.fast.upload.active.blocks`
    +limits the amount of memory which can be consumed per stream.
    +
    +* When using disk buffering a larger value of 
`fs.s3a.fast.upload.active.blocks`
    +does not consume much memory. But it may result in a large number of 
blocks to
    +compete with other filesystem operations.
    +
    +
    +We recommend a low value of `fs.s3a.fast.upload.active.blocks`; enough
    +to start background upload without overloading other parts of the system,
    +then experiment to see if higher values deliver more throughtput 
—especially
    +from VMs running on EC2.
    +
    +```xml
    +
    +<property>
    +  <name>fs.s3a.fast.upload.active.blocks</name>
    +  <value>4</value>
    +  <description>
    +    Maximum Number of blocks a single output stream can have
    +    active (uploading, or queued to the central FileSystem
    +    instance's pool of queued operations.
    +
    +    This stops a single stream overloading the shared thread pool.
    +  </description>
    +</property>
    +
    +<property>
    +  <name>fs.s3a.threads.max</name>
    +  <value>10</value>
    +  <description>The total number of threads available in the filesystem for 
data
    +    uploads *or any other queued filesystem operation*.</description>
    +</property>
    +
    +<property>
    +  <name>fs.s3a.max.total.tasks</name>
    +  <value>5</value>
    +  <description>The number of operations which can be queued for 
execution</description>
    +</property>
    +
    +<property>
    +  <name>fs.s3a.threads.keepalivetime</name>
    +  <value>60</value>
    +  <description>Number of seconds a thread can be idle before being
    +    terminated.</description>
    +</property>
    +
    +```
    +
    +
    +#### <a name="s3a_multipart_purge"></a>Cleaning up After Incremental 
Upload Failures: `fs.s3a.multipart.purge`
    +
    +
    +If an incremental streaming operation is interrupted, there may be 
    +intermediate partitions uploaded to S3 —data which will be billed for.
    +
    +These charges can be reduced by enabling `fs.s3a.multipart.purge`, 
    +and setting a purge time in seconds, such as 86400 seconds —24 hours, after
    +which the S3 service automatically deletes outstanding multipart
    --- End diff --
    
    Does it? Never knew that. I'd thought it was server side. Will change. 
Also, we could make that an async operation; it's not needed to bring up the FS.


> S3ABlockOutputStream to support huge (many GB) file writes
> ----------------------------------------------------------
>
>                 Key: HADOOP-13560
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13560
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.9.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>         Attachments: HADOOP-13560-branch-2-001.patch, 
> HADOOP-13560-branch-2-002.patch, HADOOP-13560-branch-2-003.patch, 
> HADOOP-13560-branch-2-004.patch
>
>
> An AWS SDK [issue|https://github.com/aws/aws-sdk-java/issues/367] highlights 
> that metadata isn't copied on large copies.
> 1. Add a test to do that large copy/rname and verify that the copy really 
> works
> 2. Verify that metadata makes it over.
> Verifying large file rename is important on its own, as it is needed for very 
> large commit operations for committers using rename



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to