[ 
https://issues.apache.org/jira/browse/HADOOP-13560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13560:
------------------------------------
    Description: 
There's two output stream mechanisms in Hadooop 2.7.x, neither of which handle 
massive multi-GB files that well.

"classic": buffer everything to HDD until to the close() operation; time to 
close becomes O(data); as is available disk space. Fails to exploit exploit 
idle bandwidth, and on EC2 VMs with not much HDD capacity (especially 
completing with HDFS storage), can fill up the disk.

{{S3AFastOutputStream}} uploads data in partition-sized blocks, buffering via 
byte arrays. Avoids disk problems and as it writes as soon as the first 
partition is ready, close() time is O(outstanding-data). However: needs tuning 
to reduce amount of data buffered. Get it wrong, and the first clue you get may 
be that the process goes OOM or is killed by YARN. Which is a shame, as get it 
right and operations which generates lots of data, complete much faster, 
including distcp.

This patch proposes a new output stream, a successor to both, 
{{S3ABlockOutputStream}}.

# uses block upload model of S3AFastOutputStream
# supports buffering via: HDD, heap and (recycled) byte buffer, offering a 
choice between memory and HDD use. HDD: no OOM problems on small JVMs/need to 
tune.
# Uses the fast output stream mechanism of limiting queue size for data to 
upload. Even when buffering via HDD, you may need to limit that use.
# lots of instrumentation to see what's being written.
# good defaults out the box (e.g buffer to HDD, partition size to strike a good 
balance of early upload and scaleability)
# robust against transient failures. The AWS SDK retries a PUT on failure; the 
entire block may need to be replayed, so HDD input cannot be buffered via 
{{java.io.BufferedInputStream}}. It has also surfaced in testing that if the 
final commit of a multipart option fails, it isn't retried —at least in the 
current SDK in use. Do that ourselves.
# use roundrobin directory allocation, for most effective disk use
# take an AWS SDK {{com.amazonaws.event.ProgressListener}} for progress 
callbacks, giving more detail on the operation. (It actually takes a 
{{org.apache.hadoop.util.Progressable}}, but if that also implements the AWS 
interface, that is used instead.

All of this to come with scale tests

* generate large files using all buffer mechanisms
* Do a large copy/rname and verify that the copy really works, including 
metadata
* be configurable with sizes up to muti-GB, which also means that the test 
timeouts need to be configurable to match the time it can take.
* As they are slow, make them optional, using the {{-Dscale}} switch to enable.

Verifying large file rename is important on its own, as it is needed for very 
large commit operations for committers using rename

The goal here is to implement a single, object stream which can be used for all 
outputs, tuneable as to whether to use disk or memory, and on queue sizes, but 
otherwise be all that's needed. We can do future development on this, remove 
its predecessor {{S3AFastOutputStream}}, so keeping docs and testing down, and 
leave the original {{S3AOutputStream}} alone for regression testing/fallback.

  was:
An AWS SDK [issue|https://github.com/aws/aws-sdk-java/issues/367] highlights 
that metadata isn't copied on large copies.

1. Add a test to do that large copy/rname and verify that the copy really works
2. Verify that metadata makes it over.

Verifying large file rename is important on its own, as it is needed for very 
large commit operations for committers using rename


> S3ABlockOutputStream to support huge (many GB) file writes
> ----------------------------------------------------------
>
>                 Key: HADOOP-13560
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13560
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.9.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>             Fix For: 2.8.0, 3.0.0-alpha2
>
>         Attachments: HADOOP-13560-branch-2-001.patch, 
> HADOOP-13560-branch-2-002.patch, HADOOP-13560-branch-2-003.patch, 
> HADOOP-13560-branch-2-004.patch
>
>
> There's two output stream mechanisms in Hadooop 2.7.x, neither of which 
> handle massive multi-GB files that well.
> "classic": buffer everything to HDD until to the close() operation; time to 
> close becomes O(data); as is available disk space. Fails to exploit exploit 
> idle bandwidth, and on EC2 VMs with not much HDD capacity (especially 
> completing with HDFS storage), can fill up the disk.
> {{S3AFastOutputStream}} uploads data in partition-sized blocks, buffering via 
> byte arrays. Avoids disk problems and as it writes as soon as the first 
> partition is ready, close() time is O(outstanding-data). However: needs 
> tuning to reduce amount of data buffered. Get it wrong, and the first clue 
> you get may be that the process goes OOM or is killed by YARN. Which is a 
> shame, as get it right and operations which generates lots of data, complete 
> much faster, including distcp.
> This patch proposes a new output stream, a successor to both, 
> {{S3ABlockOutputStream}}.
> # uses block upload model of S3AFastOutputStream
> # supports buffering via: HDD, heap and (recycled) byte buffer, offering a 
> choice between memory and HDD use. HDD: no OOM problems on small JVMs/need to 
> tune.
> # Uses the fast output stream mechanism of limiting queue size for data to 
> upload. Even when buffering via HDD, you may need to limit that use.
> # lots of instrumentation to see what's being written.
> # good defaults out the box (e.g buffer to HDD, partition size to strike a 
> good balance of early upload and scaleability)
> # robust against transient failures. The AWS SDK retries a PUT on failure; 
> the entire block may need to be replayed, so HDD input cannot be buffered via 
> {{java.io.BufferedInputStream}}. It has also surfaced in testing that if the 
> final commit of a multipart option fails, it isn't retried —at least in the 
> current SDK in use. Do that ourselves.
> # use roundrobin directory allocation, for most effective disk use
> # take an AWS SDK {{com.amazonaws.event.ProgressListener}} for progress 
> callbacks, giving more detail on the operation. (It actually takes a 
> {{org.apache.hadoop.util.Progressable}}, but if that also implements the AWS 
> interface, that is used instead.
> All of this to come with scale tests
> * generate large files using all buffer mechanisms
> * Do a large copy/rname and verify that the copy really works, including 
> metadata
> * be configurable with sizes up to muti-GB, which also means that the test 
> timeouts need to be configurable to match the time it can take.
> * As they are slow, make them optional, using the {{-Dscale}} switch to 
> enable.
> Verifying large file rename is important on its own, as it is needed for very 
> large commit operations for committers using rename
> The goal here is to implement a single, object stream which can be used for 
> all outputs, tuneable as to whether to use disk or memory, and on queue 
> sizes, but otherwise be all that's needed. We can do future development on 
> this, remove its predecessor {{S3AFastOutputStream}}, so keeping docs and 
> testing down, and leave the original {{S3AOutputStream}} alone for regression 
> testing/fallback.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to