[GitHub] [flink] kl0u commented on a change in pull request #9530: [FLINK-13842][docs] Improve Documentation of the StreamingFileSink

GitBox Mon, 02 Sep 2019 03:27:39 -0700

kl0u commented on a change in pull request #9530: [FLINK-13842][docs] Improve 
Documentation of the StreamingFileSink
URL: https://github.com/apache/flink/pull/9530#discussion_r319895944


 ##########
 File path: docs/dev/connectors/streamfile_sink.md
 ##########
 @@ -23,30 +23,83 @@ specific language governing permissions and limitations
 under the License.
 -->
 
+* This will be replaced by the TOC
+{:toc}
+
 This connector provides a Sink that writes partitioned files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{ 
site.baseurl}}/ops/filesystems/index.html).
 
-Since in streaming the input is potentially infinite, the streaming file sink 
writes data
-into buckets. The bucketing behaviour is configurable but a useful default is 
time-based
-bucketing where we start writing a new bucket every hour and thus get
-individual files that each contain a part of the infinite output stream.
+In order to handle unbounded data streams, the streaming file sink writes 
incoming data
+into buckets. The bucketing behaviour is fully configurable with a default 
time-based
+bucketing where we start writing a new bucket every hour and thus get files 
that correspond to
+records received during certain time intervals from the stream.
+
+The bucket directories themselves contain several part files with the actual 
output data, with at least
+one for each parallel subtask of the sink. Additional part files will be 
created according to the configurable
+rolling policy. The default policy rolls files based on size and a timeout.
+
+Part files can be in one of three states:
+ 1. **In-progress** : The part file that is currently being written to is 
in-progress
+ 2. **Pending** : Once a part file is closed for writing it becomes pending
+ 3. **Finished** : On successful checkpoints pending files become finished
+
+Each writer subtask will have a single in-progress part file at any given time 
for every active bucket, but there can be several pending and finished files.
+
+<img src="{{ site.baseurl }}/fig/streamfilesink_bucketing.png" class="center" 
style="width: 100%;" />
+
+### Bucket Assignment
+
+The bucketing logic defines how the data will be structured into 
subdirectories inside the base output directory.
+
+Both row and bulk formats use the [DateTimeBucketAssigner]({{ 
site.javadocs_baseurl 
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.html)
 as the default assigner.
+By default the DateTimeBucketAssigner creates hourly buckets based on the 
system default timezone
+with the following format: `yyyy-MM-dd--HH`. Both the date format (i.e. bucket 
size) and timezone can be
+configured manually.
+
+We can specify a custom [BucketAssigner]({{ site.javadocs_baseurl 
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/BucketAssigner.html)
 by calling `.withBucketAssigner(assigner)` on the format builders.
+
+Flink comes with two built in BucketAssigners:
+
+ - [DateTimeBucketAssigner]({{ site.javadocs_baseurl 
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.html)
 : Default time based assigner
+ - [BasePathBucketAssigner]({{ site.javadocs_baseurl 
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/BasePathBucketAssigner.html)
 : Assigner that stores all part files in the base path (single global bucket)
+
+### Rolling Policy
 
-Within a bucket, we further split the output into smaller part files based on a
-rolling policy. This is useful to prevent individual bucket files from getting
-too big. This is also configurable but the default policy rolls files based on
-file size and a timeout, *i.e* if no new data was written to a part file. 
+The [RollingPolicy]({{ site.javadocs_baseurl 
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/RollingPolicy.html)
 defines when a given in-progress part file will be closed and moved to the 
pending and later to finished state.
+In combination with the checkpointing interval (pending files become finished 
on the next checkpoint) this controls how quickly
+part files become available for downstream readers and also the size and 
number of these parts.
 
-The `StreamingFileSink` supports both row-wise encoding formats and
-bulk-encoding formats, such as [Apache Parquet](http://parquet.apache.org).
+Flink comes with two built-in RollingPolicies:
 
-#### Using Row-encoded Output Formats
+- [DefaultRollingPolicy]({{ site.javadocs_baseurl 
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/rollingpolicies/DefaultRollingPolicy.html)
+- [OnCheckpointRollingPolicy]({{ site.javadocs_baseurl 
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/rollingpolicies/OnCheckpointRollingPolicy.html)
 
-The only required configuration are the base path where we want to output our
-data and an
-[Encoder]({{ site.javadocs_baseurl 
}}/api/java/org/apache/flink/api/common/serialization/Encoder.html)
-that is used for serializing records to the `OutputStream` for each file.
+Time based rolling policies are checked at pre-defined intervals that can be 
configured on the format builders.
 
 
 Review comment:
   "Time based rolling policies are checked at pre-defined intervals that can 
be configured on the format builders." -> This seems a bit cryptic. I think it 
could be simply removed, if you agree.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [flink] kl0u commented on a change in pull request #9530: [FLINK-13842][docs] Improve Documentation of the StreamingFileSink

Reply via email to