[flink] branch release-1.8 updated: [FLINK-11984][docs] MPU timeout implications on StreamingFileSink.

kkloudas Mon, 25 Mar 2019 09:00:25 -0700

This is an automated email from the ASF dual-hosted git repository.

kkloudas pushed a commit to branch release-1.8
in repository https://gitbox.apache.org/repos/asf/flink.git



The following commit(s) were added to refs/heads/release-1.8 by this push:
     new 839886e  [FLINK-11984][docs] MPU timeout implications on 
StreamingFileSink.
839886e is described below

commit 839886e558c4c4c3d8a5f6639adb257a1bcdfbe3
Author: Kostas Kloudas <[email protected]>
AuthorDate: Wed Mar 20 13:53:07 2019 +0100

    [FLINK-11984][docs] MPU timeout implications on StreamingFileSink.
---
 docs/dev/connectors/streamfile_sink.md | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/docs/dev/connectors/streamfile_sink.md 
b/docs/dev/connectors/streamfile_sink.md
index 82ab562..353a1f8 100644
--- a/docs/dev/connectors/streamfile_sink.md
+++ b/docs/dev/connectors/streamfile_sink.md
@@ -26,14 +26,6 @@ under the License.
 This connector provides a Sink that writes partitioned files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{ 
site.baseurl}}/ops/filesystems.html).
 
-<span class="label label-danger">Important Note</span>: For S3, the 
`StreamingFileSink` 
-supports only the [Hadoop-based](https://hadoop.apache.org/) FileSystem 
implementation, not
-the implementation based on [Presto](https://prestodb.io/). In case your job 
uses the 
-`StreamingFileSink` to write to S3 but you want to use the Presto-based one 
for checkpointing,
-it is advised to use explicitly *"s3a://"* (for Hadoop) as the scheme for the 
target path of
-the sink and *"s3p://"* for checkpointing (for Presto). Using *"s3://"* for 
both the sink
-and checkpointing may lead to unpredictable behavior, as both implementations 
"listen" to that scheme.
-
 Since in streaming the input is potentially infinite, the streaming file sink 
writes data
 into buckets. The bucketing behaviour is configurable but a useful default is 
time-based
 bucketing where we start writing a new bucket every hour and thus get
@@ -123,4 +115,24 @@ has static methods for creating a `BulkWriter.Factory` for 
various types.
     every checkpoint.
 </div>
 
+#### Important Considerations for S3
+
+<span class="label label-danger">Important Note 1</span>: For S3, the 
`StreamingFileSink` 
+supports only the [Hadoop-based](https://hadoop.apache.org/) FileSystem 
implementation, not
+the implementation based on [Presto](https://prestodb.io/). In case your job 
uses the 
+`StreamingFileSink` to write to S3 but you want to use the Presto-based one 
for checkpointing,
+it is advised to use explicitly *"s3a://"* (for Hadoop) as the scheme for the 
target path of
+the sink and *"s3p://"* for checkpointing (for Presto). Using *"s3://"* for 
both the sink
+and checkpointing may lead to unpredictable behavior, as both implementations 
"listen" to that scheme.
+
+<span class="label label-danger">Important Note 2</span>: To guarantee 
exactly-once semantics while
+being efficient, the `StreamingFileSink` uses the [Multi-part 
Upload](https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html)
+feature of S3 (MPU from now on). This feature allows to upload files in 
independent chunks (thus the "multi-part")
+which can be combined into the original file when all the parts of the MPU are 
successfully uploaded. 
+For inactive MPUs, S3 supports a bucket lifecycle rule that the user can use 
to abort multipart uploads 
+that don't complete within a specified number of days after being initiated. 
This implies that if you set this rule 
+aggressively and take a savepoint with some part-files being not fully 
uploaded, their associated MPUs may time-out 
+before the job is restarted. This will result in your job not being able to 
restore from that savepoint as the
+pending part-files are no longer there and Flink will fail with an exception 
as it tries to fetch them and fails.
+
 {% top %}

[flink] branch release-1.8 updated: [FLINK-11984][docs] MPU timeout implications on StreamingFileSink.

Reply via email to