[flink] 05/05: [FLINK-12944][docs-zh] Translate "Streaming File Sink" page into Chinese

jark Tue, 02 Jul 2019 19:08:54 -0700

This is an automated email from the ASF dual-hosted git repository.

jark pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/flink.git


commit 29a0e8e5e7d69b0450ac3865c25df0e5d75d758b
Author: aloys <[email protected]>
AuthorDate: Fri Jun 28 00:39:56 2019 +0800

    [FLINK-12944][docs-zh] Translate "Streaming File Sink" page into Chinese
    
    This closes #8918
---
 docs/dev/connectors/streamfile_sink.zh.md | 93 ++++++++++++-------------------
 1 file changed, 35 insertions(+), 58 deletions(-)

diff --git a/docs/dev/connectors/streamfile_sink.zh.md 
b/docs/dev/connectors/streamfile_sink.zh.md
index 40b104b..79586ad 100644
--- a/docs/dev/connectors/streamfile_sink.zh.md
+++ b/docs/dev/connectors/streamfile_sink.zh.md
@@ -23,30 +23,22 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-This connector provides a Sink that writes partitioned files to filesystems
-supported by the [Flink `FileSystem` abstraction]({{ 
site.baseurl}}/ops/filesystems/index.html).
+这个连接器提供了一个 Sink 来将分区文件写入到支持 [Flink `FileSystem`]({{ 
site.baseurl}}/zh/ops/filesystems/index.html) 接口的文件系统中。
 
-Since in streaming the input is potentially infinite, the streaming file sink 
writes data
-into buckets. The bucketing behaviour is configurable but a useful default is 
time-based
-bucketing where we start writing a new bucket every hour and thus get
-individual files that each contain a part of the infinite output stream.
+由于在流处理中输入可能是无限的，所以流处理的文件 sink 会将数据写入到桶中。如何分桶是可以配置的，一种有效的默认
+策略是基于时间的分桶，这种策略每个小时写入一个新的桶，这些桶各包含了无限输出流的一部分数据。
 
-Within a bucket, we further split the output into smaller part files based on a
-rolling policy. This is useful to prevent individual bucket files from getting
-too big. This is also configurable but the default policy rolls files based on
-file size and a timeout, *i.e* if no new data was written to a part file.
+在一个桶内部，会进一步将输出基于滚动策略切分成更小的文件。这有助于防止桶文件变得过大。滚动策略也是可以配置的，默认
+策略会根据文件大小和超时时间来滚动文件，超时时间是指没有新数据写入部分文件（part file）的时间。
 
-The `StreamingFileSink` supports both row-wise encoding formats and
-bulk-encoding formats, such as [Apache Parquet](http://parquet.apache.org).
+`StreamingFileSink` 支持行编码格式和批量编码格式，比如 [Apache 
Parquet](http://parquet.apache.org)。
 
-#### Using Row-encoded Output Formats
+#### 使用行编码输出格式
 
-The only required configuration are the base path where we want to output our
-data and an
-[Encoder]({{ site.javadocs_baseurl 
}}/api/java/org/apache/flink/api/common/serialization/Encoder.html)
-that is used for serializing records to the `OutputStream` for each file.
+只需要配置一个输出路径和一个 [Encoder]({{ site.javadocs_baseurl 
}}/api/java/org/apache/flink/api/common/serialization/Encoder.html)。
+Encoder负责为每个文件的 `OutputStream` 序列化数据。
 
-Basic usage thus looks like this:
+基本用法如下:
 
 
 <div class="codetabs" markdown="1">
@@ -84,55 +76,40 @@ input.addSink(sink)
 </div>
 </div>
 
-This will create a streaming sink that creates hourly buckets and uses a
-default rolling policy. The default bucket assigner is
+上面的代码创建了一个按小时分桶、按默认策略滚动的 sink。默认分桶器是
 [DateTimeBucketAssigner]({{ site.javadocs_baseurl 
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.html)
-and the default rolling policy is
-[DefaultRollingPolicy]({{ site.javadocs_baseurl 
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/rollingpolicies/DefaultRollingPolicy.html).
-You can specify a custom
+，默认滚动策略是
+[DefaultRollingPolicy]({{ site.javadocs_baseurl 
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/rollingpolicies/DefaultRollingPolicy.html)。
+可以为 sink builder 自定义
 [BucketAssigner]({{ site.javadocs_baseurl 
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/BucketAssigner.html)
-and
-[RollingPolicy]({{ site.javadocs_baseurl 
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/RollingPolicy.html)
-on the sink builder. Please check out the JavaDoc for
-[StreamingFileSink]({{ site.javadocs_baseurl 
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.html)
-for more configuration options and more documentation about the workings and
-interactions of bucket assigners and rolling policies.
+和
+[RollingPolicy]({{ site.javadocs_baseurl 
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/RollingPolicy.html)。
+更多配置操作以及分桶器和滚动策略的工作机制和相互影响请参考：
+[StreamingFileSink]({{ site.javadocs_baseurl 
}}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.html)。
 
-#### Using Bulk-encoded Output Formats
+#### 使用批量编码输出格式
 
-In the above example we used an `Encoder` that can encode or serialize each
-record individually. The streaming file sink also supports bulk-encoded output
-formats such as [Apache Parquet](http://parquet.apache.org). To use these,
-instead of `StreamingFileSink.forRowFormat()` you would use
-`StreamingFileSink.forBulkFormat()` and specify a `BulkWriter.Factory`.
+上面的示例使用 `Encoder` 分别序列化每一个记录。除此之外，流式文件 sink 还支持批量编码的输出格式，比如 [Apache 
Parquet](http://parquet.apache.org)。
+使用这种编码格式需要用 `StreamingFileSink.forBulkFormat()` 来代替 
`StreamingFileSink.forRowFormat()` ，然后指定一个 `BulkWriter.Factory`。
 
 [ParquetAvroWriters]({{ site.javadocs_baseurl 
}}/api/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.html)
-has static methods for creating a `BulkWriter.Factory` for various types.
+中包含了为各种类型创建 `BulkWriter.Factory` 的静态方法。
 
 <div class="alert alert-info">
-    <b>IMPORTANT:</b> Bulk-encoding formats can only be combined with the
-    `OnCheckpointRollingPolicy`, which rolls the in-progress part file on
-    every checkpoint.
+    <b>重要:</b> 批量编码格式只能和 `OnCheckpointRollingPolicy` 结合使用，每次做 checkpoint 时滚动文件。
 </div>
 
-#### Important Considerations for S3
-
-<span class="label label-danger">Important Note 1</span>: For S3, the 
`StreamingFileSink`
-supports only the [Hadoop-based](https://hadoop.apache.org/) FileSystem 
implementation, not
-the implementation based on [Presto](https://prestodb.io/). In case your job 
uses the
-`StreamingFileSink` to write to S3 but you want to use the Presto-based one 
for checkpointing,
-it is advised to use explicitly *"s3a://"* (for Hadoop) as the scheme for the 
target path of
-the sink and *"s3p://"* for checkpointing (for Presto). Using *"s3://"* for 
both the sink
-and checkpointing may lead to unpredictable behavior, as both implementations 
"listen" to that scheme.
-
-<span class="label label-danger">Important Note 2</span>: To guarantee 
exactly-once semantics while
-being efficient, the `StreamingFileSink` uses the [Multi-part 
Upload](https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html)
-feature of S3 (MPU from now on). This feature allows to upload files in 
independent chunks (thus the "multi-part")
-which can be combined into the original file when all the parts of the MPU are 
successfully uploaded.
-For inactive MPUs, S3 supports a bucket lifecycle rule that the user can use 
to abort multipart uploads
-that don't complete within a specified number of days after being initiated. 
This implies that if you set this rule
-aggressively and take a savepoint with some part-files being not fully 
uploaded, their associated MPUs may time-out
-before the job is restarted. This will result in your job not being able to 
restore from that savepoint as the
-pending part-files are no longer there and Flink will fail with an exception 
as it tries to fetch them and fails.
+####  关于S3的重要内容
+
+<span class="label label-danger">重要提示 1</span>: 对于 S3，`StreamingFileSink`  
只支持基于 [Hadoop](https://hadoop.apache.org/) 
+的文件系统实现，不支持基于 [Presto](https://prestodb.io/) 的实现。如果想使用 `StreamingFileSink` 向 
S3 写入数据并且将 
+checkpoint 放在基于 Presto 的文件系统，建议明确指定 *"s3a://"* （for Hadoop）作为sink的目标路径方案，并且为 
checkpoint 路径明确指定 *"s3p://"* （for Presto）。
+如果 Sink 和 checkpoint 都使用 *"s3://"* 路径的话，可能会导致不可预知的行为，因为双方的实现都在“监听”这个路径。
+
+<span class="label label-danger">重要提示 2</span>: `StreamingFileSink` 使用 S3 的 
[Multi-part 
Upload](https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html)
+（后续使用MPU代替）特性可以保证精确一次的语义。这个特性支持以独立的块（因此被称为"multi-part"）模式上传文件，当 MPU 的所有部分文件
+成功上传之后，可以合并成原始文件。对于失效的 MPUs，S3 提供了一个基于桶生命周期的规则，用户可以用这个规则来丢弃在指定时间内未完成的MPU。
+如果在一些部分文件还未上传时触发 savepoint，并且这个规则设置的比较严格，这意味着相关的 MPU在作业重启之前可能会超时。后续的部分文件没
+有写入到 savepoint, 那么在 Flink 作业从 savepoint 恢复时，会因为拿不到缺失的部分文件，导致任务失败并抛出异常。
 
 {% top %}

[flink] 05/05: [FLINK-12944][docs-zh] Translate "Streaming File Sink" page into Chinese

Reply via email to