This is an automated email from the ASF dual-hosted git repository. jark pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/flink.git
commit 29a0e8e5e7d69b0450ac3865c25df0e5d75d758b Author: aloys <[email protected]> AuthorDate: Fri Jun 28 00:39:56 2019 +0800 [FLINK-12944][docs-zh] Translate "Streaming File Sink" page into Chinese This closes #8918 --- docs/dev/connectors/streamfile_sink.zh.md | 93 ++++++++++++------------------- 1 file changed, 35 insertions(+), 58 deletions(-) diff --git a/docs/dev/connectors/streamfile_sink.zh.md b/docs/dev/connectors/streamfile_sink.zh.md index 40b104b..79586ad 100644 --- a/docs/dev/connectors/streamfile_sink.zh.md +++ b/docs/dev/connectors/streamfile_sink.zh.md @@ -23,30 +23,22 @@ specific language governing permissions and limitations under the License. --> -This connector provides a Sink that writes partitioned files to filesystems -supported by the [Flink `FileSystem` abstraction]({{ site.baseurl}}/ops/filesystems/index.html). +这个连接器提供了一个 Sink 来将分区文件写入到支持 [Flink `FileSystem`]({{ site.baseurl}}/zh/ops/filesystems/index.html) 接口的文件系统中。 -Since in streaming the input is potentially infinite, the streaming file sink writes data -into buckets. The bucketing behaviour is configurable but a useful default is time-based -bucketing where we start writing a new bucket every hour and thus get -individual files that each contain a part of the infinite output stream. +由于在流处理中输入可能是无限的,所以流处理的文件 sink 会将数据写入到桶中。如何分桶是可以配置的,一种有效的默认 +策略是基于时间的分桶,这种策略每个小时写入一个新的桶,这些桶各包含了无限输出流的一部分数据。 -Within a bucket, we further split the output into smaller part files based on a -rolling policy. This is useful to prevent individual bucket files from getting -too big. This is also configurable but the default policy rolls files based on -file size and a timeout, *i.e* if no new data was written to a part file. +在一个桶内部,会进一步将输出基于滚动策略切分成更小的文件。这有助于防止桶文件变得过大。滚动策略也是可以配置的,默认 +策略会根据文件大小和超时时间来滚动文件,超时时间是指没有新数据写入部分文件(part file)的时间。 -The `StreamingFileSink` supports both row-wise encoding formats and -bulk-encoding formats, such as [Apache Parquet](http://parquet.apache.org). +`StreamingFileSink` 支持行编码格式和批量编码格式,比如 [Apache Parquet](http://parquet.apache.org)。 -#### Using Row-encoded Output Formats +#### 使用行编码输出格式 -The only required configuration are the base path where we want to output our -data and an -[Encoder]({{ site.javadocs_baseurl }}/api/java/org/apache/flink/api/common/serialization/Encoder.html) -that is used for serializing records to the `OutputStream` for each file. +只需要配置一个输出路径和一个 [Encoder]({{ site.javadocs_baseurl }}/api/java/org/apache/flink/api/common/serialization/Encoder.html)。 +Encoder负责为每个文件的 `OutputStream` 序列化数据。 -Basic usage thus looks like this: +基本用法如下: <div class="codetabs" markdown="1"> @@ -84,55 +76,40 @@ input.addSink(sink) </div> </div> -This will create a streaming sink that creates hourly buckets and uses a -default rolling policy. The default bucket assigner is +上面的代码创建了一个按小时分桶、按默认策略滚动的 sink。默认分桶器是 [DateTimeBucketAssigner]({{ site.javadocs_baseurl }}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.html) -and the default rolling policy is -[DefaultRollingPolicy]({{ site.javadocs_baseurl }}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/rollingpolicies/DefaultRollingPolicy.html). -You can specify a custom +,默认滚动策略是 +[DefaultRollingPolicy]({{ site.javadocs_baseurl }}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/rollingpolicies/DefaultRollingPolicy.html)。 +可以为 sink builder 自定义 [BucketAssigner]({{ site.javadocs_baseurl }}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/BucketAssigner.html) -and -[RollingPolicy]({{ site.javadocs_baseurl }}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/RollingPolicy.html) -on the sink builder. Please check out the JavaDoc for -[StreamingFileSink]({{ site.javadocs_baseurl }}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.html) -for more configuration options and more documentation about the workings and -interactions of bucket assigners and rolling policies. +和 +[RollingPolicy]({{ site.javadocs_baseurl }}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/RollingPolicy.html)。 +更多配置操作以及分桶器和滚动策略的工作机制和相互影响请参考: +[StreamingFileSink]({{ site.javadocs_baseurl }}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.html)。 -#### Using Bulk-encoded Output Formats +#### 使用批量编码输出格式 -In the above example we used an `Encoder` that can encode or serialize each -record individually. The streaming file sink also supports bulk-encoded output -formats such as [Apache Parquet](http://parquet.apache.org). To use these, -instead of `StreamingFileSink.forRowFormat()` you would use -`StreamingFileSink.forBulkFormat()` and specify a `BulkWriter.Factory`. +上面的示例使用 `Encoder` 分别序列化每一个记录。除此之外,流式文件 sink 还支持批量编码的输出格式,比如 [Apache Parquet](http://parquet.apache.org)。 +使用这种编码格式需要用 `StreamingFileSink.forBulkFormat()` 来代替 `StreamingFileSink.forRowFormat()` ,然后指定一个 `BulkWriter.Factory`。 [ParquetAvroWriters]({{ site.javadocs_baseurl }}/api/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.html) -has static methods for creating a `BulkWriter.Factory` for various types. +中包含了为各种类型创建 `BulkWriter.Factory` 的静态方法。 <div class="alert alert-info"> - <b>IMPORTANT:</b> Bulk-encoding formats can only be combined with the - `OnCheckpointRollingPolicy`, which rolls the in-progress part file on - every checkpoint. + <b>重要:</b> 批量编码格式只能和 `OnCheckpointRollingPolicy` 结合使用,每次做 checkpoint 时滚动文件。 </div> -#### Important Considerations for S3 - -<span class="label label-danger">Important Note 1</span>: For S3, the `StreamingFileSink` -supports only the [Hadoop-based](https://hadoop.apache.org/) FileSystem implementation, not -the implementation based on [Presto](https://prestodb.io/). In case your job uses the -`StreamingFileSink` to write to S3 but you want to use the Presto-based one for checkpointing, -it is advised to use explicitly *"s3a://"* (for Hadoop) as the scheme for the target path of -the sink and *"s3p://"* for checkpointing (for Presto). Using *"s3://"* for both the sink -and checkpointing may lead to unpredictable behavior, as both implementations "listen" to that scheme. - -<span class="label label-danger">Important Note 2</span>: To guarantee exactly-once semantics while -being efficient, the `StreamingFileSink` uses the [Multi-part Upload](https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html) -feature of S3 (MPU from now on). This feature allows to upload files in independent chunks (thus the "multi-part") -which can be combined into the original file when all the parts of the MPU are successfully uploaded. -For inactive MPUs, S3 supports a bucket lifecycle rule that the user can use to abort multipart uploads -that don't complete within a specified number of days after being initiated. This implies that if you set this rule -aggressively and take a savepoint with some part-files being not fully uploaded, their associated MPUs may time-out -before the job is restarted. This will result in your job not being able to restore from that savepoint as the -pending part-files are no longer there and Flink will fail with an exception as it tries to fetch them and fails. +#### 关于S3的重要内容 + +<span class="label label-danger">重要提示 1</span>: 对于 S3,`StreamingFileSink` 只支持基于 [Hadoop](https://hadoop.apache.org/) +的文件系统实现,不支持基于 [Presto](https://prestodb.io/) 的实现。如果想使用 `StreamingFileSink` 向 S3 写入数据并且将 +checkpoint 放在基于 Presto 的文件系统,建议明确指定 *"s3a://"* (for Hadoop)作为sink的目标路径方案,并且为 checkpoint 路径明确指定 *"s3p://"* (for Presto)。 +如果 Sink 和 checkpoint 都使用 *"s3://"* 路径的话,可能会导致不可预知的行为,因为双方的实现都在“监听”这个路径。 + +<span class="label label-danger">重要提示 2</span>: `StreamingFileSink` 使用 S3 的 [Multi-part Upload](https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html) +(后续使用MPU代替)特性可以保证精确一次的语义。这个特性支持以独立的块(因此被称为"multi-part")模式上传文件,当 MPU 的所有部分文件 +成功上传之后,可以合并成原始文件。对于失效的 MPUs,S3 提供了一个基于桶生命周期的规则,用户可以用这个规则来丢弃在指定时间内未完成的MPU。 +如果在一些部分文件还未上传时触发 savepoint,并且这个规则设置的比较严格,这意味着相关的 MPU在作业重启之前可能会超时。后续的部分文件没 +有写入到 savepoint, 那么在 Flink 作业从 savepoint 恢复时,会因为拿不到缺失的部分文件,导致任务失败并抛出异常。 {% top %}
