kl0u commented on a change in pull request #9530: [FLINK-13842][docs] Improve
Documentation of the StreamingFileSink
URL: https://github.com/apache/flink/pull/9530#discussion_r321239484
##########
File path: docs/dev/connectors/streamfile_sink.md
##########
@@ -23,30 +23,140 @@ specific language governing permissions and limitations
under the License.
-->
+* This will be replaced by the TOC
+{:toc}
+
This connector provides a Sink that writes partitioned files to filesystems
supported by the [Flink `FileSystem` abstraction]({{
site.baseurl}}/ops/filesystems/index.html).
-Since in streaming the input is potentially infinite, the streaming file sink
writes data
-into buckets. The bucketing behaviour is configurable but a useful default is
time-based
-bucketing where we start writing a new bucket every hour and thus get
-individual files that each contain a part of the infinite output stream.
+In order to handle unbounded data streams, the streaming file sink writes
incoming data
+into buckets. The bucketing behaviour is fully configurable with a default
time-based
+bucketing where we start writing a new bucket every hour and thus get files
that correspond to
+records received during certain time intervals from the stream.
+
+The bucket directories themselves contain several part files with the actual
output data, with at least
+one for each subtask of the sink that has received data for the bucket.
Additional part files will be created according to the configurable
+rolling policy. The default policy rolls files based on size, a timeout that
specifies the maximum duration for which a file can be open, and a maximum
inactivity timeout after which the file is closed.
+
+ <div class="alert alert-info">
+ <b>IMPORTANT:</b> Checkpointing needs to be enabled when using the
StreamingFileSink. Part files can only be finalized
+ on successful checkpoints. If checkpointing is disabled part files will
forever stay in `in-progress` or `pending` state
+ and cannot be safely read by downstream systems.
+ </div>
+
+ <img src="{{ site.baseurl }}/fig/streamfilesink_bucketing.png" class="center"
style="width: 100%;" />
+
+### Bucket Assignment
+
+The bucketing logic defines how the data will be structured into
subdirectories inside the base output directory.
+
+Both row and bulk formats use the [DateTimeBucketAssigner]({{
site.javadocs_baseurl
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.html)
as the default assigner.
+By default the DateTimeBucketAssigner creates hourly buckets based on the
system default timezone
+with the following format: `yyyy-MM-dd--HH`. Both the date format (i.e. bucket
size) and timezone can be
+configured manually.
+
+We can specify a custom [BucketAssigner]({{ site.javadocs_baseurl
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/BucketAssigner.html)
by calling `.withBucketAssigner(assigner)` on the format builders.
+
+Flink comes with two built in BucketAssigners:
+
+ - [DateTimeBucketAssigner]({{ site.javadocs_baseurl
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.html)
: Default time based assigner
+ - [BasePathBucketAssigner]({{ site.javadocs_baseurl
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/BasePathBucketAssigner.html)
: Assigner that stores all part files in the base path (single global bucket)
+
+### Rolling Policy
+
+The [RollingPolicy]({{ site.javadocs_baseurl
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/RollingPolicy.html)
defines when a given in-progress part file will be closed and moved to the
pending and later to finished state.
+In combination with the checkpointing interval (pending files become finished
on the next checkpoint) this controls how quickly
+part files become available for downstream readers and also the size and
number of these parts.
+
+Flink comes with two built-in RollingPolicies:
+
+ - [DefaultRollingPolicy]({{ site.javadocs_baseurl
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/rollingpolicies/DefaultRollingPolicy.html)
+ - [OnCheckpointRollingPolicy]({{ site.javadocs_baseurl
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/rollingpolicies/OnCheckpointRollingPolicy.html)
+
+### Part file lifecycle
+
+In order to use the output of the StreamingFileSink in downstream systems, we
need to understand the naming and lifecycle of the output files produced.
+
+Part files can be in one of three states:
+ 1. **In-progress** : The part file that is currently being written to is
in-progress
+ 2. **Pending** : Once a part file is closed for writing it becomes pending
+ 3. **Finished** : On successful checkpoints pending files become finished
+
+Only finished files are safe to read by downstream systems as those are
guaranteed to not be modified later. Finished files can be distinguished by
their naming scheme only.
+
+File naming schemes:
+ - **In-progress / Pending**: `part-subtaskIndex-partFileIndex.inprogress.uid`
+ - **Finished:** `part-subtaskIndex-partFileIndex`
+
+Part file indexes are strictly increasing for any given subtask (in the order
they were created). However these indexes are not always sequential. When the
job restarts, the next part index for all subtask will be the max part index +
1.
+
Review comment:
"max part index + 1" -> "`max part index + 1`" just to highlight it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services