RocMarshal commented on a change in pull request #18718:
URL: https://github.com/apache/flink/pull/18718#discussion_r808681911
##########
File path: docs/content.zh/docs/connectors/datastream/filesystem.md
##########
@@ -377,22 +378,23 @@ Flink comes with four built-in BulkWriter factories:
* OrcBulkWriterFactory
{{< hint info >}}
-**Important** Bulk Formats can only have a rolling policy that extends the
`CheckpointRollingPolicy`.
-The latter rolls on every checkpoint. A policy can roll additionally based on
size or processing time.
+**重要** Bulk-encoded Format 仅支持一种继承了 `CheckpointRollingPolicy` 类的滚动策略。
+在每个 Checkpoint 都会滚动。另外也可以根据大小或处理时间进行滚动。
{{< /hint >}}
-##### Parquet format
+<a name="parquet-format"></a>
+
+##### Parquet Format
-Flink contains built in convenience methods for creating Parquet writer
factories for Avro data. These methods
-and their associated documentation can be found in the AvroParquetWriters
class.
+Flink 包含了为 Avro Format 数据创建 Parquet 写入工厂的内置便利方法。在 AvroParquetWriters
类中可以发现那些方法以及相关的使用说明。
Review comment:
nit:
Flink 包含了为 Avro Format 数据创建 Parquet 写入工厂的内置便利方法 -> Flink 内置了为 Avro
Format 数据创建 Parquet 写入工厂的快捷方法
a minor comment.
##########
File path: docs/content.zh/docs/connectors/datastream/filesystem.md
##########
@@ -232,71 +232,72 @@ new HiveSource<>(
{{< /tab >}}
{{< /tabs >}}
-### Current Limitations
+<a name="current-limitations"></a>
+
+### 当前限制
+
+对于大量积压的文件,Watermark 效果不佳。这是因为 Watermark 急于在一个文件中推进,而下一个文件可能包含比 Watermark 更晚的数据。
-Watermarking does not work very well for large backlogs of files. This is
because watermarks eagerly advance within a file, and the next file might
contain data later than the watermark.
+对于无界 File Sources,枚举器会会将当前所有已处理文件的路径记录到 state 中,在某些情况下,这可能会导致状态变得相当大。
+未来计划将引入一种压缩的方式来跟踪已经处理的文件(例如,将修改时间戳保持在边界以下)。
-For Unbounded File Sources, the enumerator currently remembers paths of all
already processed files, which is a state that can, in some cases, grow rather
large.
-There are plans to add a compressed form of tracking already processed files
in the future (for example, by keeping modification timestamps below
boundaries).
+<a name="behind-the-scenes"></a>
-### Behind the Scenes
+### 后记
{{< hint info >}}
-If you are interested in how File Source works through the new data source API
design, you may
-want to read this part as a reference. For details about the new data source
API, check out the
-[documentation on data sources]({{< ref "docs/dev/datastream/sources.md" >}})
and
+如果你对新设计的数据源 API 中的 File Sources 是如何工作的感兴趣,可以阅读本部分作为参考。关于新的数据源 API 的更多细节,请参考
+[documentation on data sources]({{< ref "docs/dev/datastream/sources.md" >}})
和在
<a
href="https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface">FLIP-27</a>
-for more descriptive discussions.
+中获取更加具体的讨论详情。
{{< /hint >}}
+<a name="file-sink"></a>
+
## File Sink
-The file sink writes incoming data into buckets. Given that the incoming
streams can be unbounded,
-data in each bucket is organized into part files of finite size. The bucketing
behaviour is fully configurable
-with a default time-based bucketing where we start writing a new bucket every
hour. This means that each resulting
-bucket will contain files with records received during 1 hour intervals from
the stream.
+File Sink 将传入的数据写入存储桶中。考虑到输入流可以是无界的,每个桶中的数据被组织成有限大小的 Part 文件。
+完全可以配置为基于时间的方式往桶中写入数据,比如我们可以设置每个小时的数据写入一个新桶中。这意味着桶中将包含一个小时间隔内接收到的记录。
-Data within the bucket directories is split into part files. Each bucket will
contain at least one part file for
-each subtask of the sink that has received data for that bucket. Additional
part files will be created according to the configurable
-rolling policy. For `Row-encoded Formats` (see [File Formats](#file-formats))
the default policy rolls part files based
-on size, a timeout that specifies the maximum duration for which a file can be
open, and a maximum inactivity
-timeout after which the file is closed. For `Bulk-encoded Formats` we roll on
every checkpoint and the user can
-specify additional conditions based on size or time.
+桶目录中的数据被拆分成多个 Part 文件。对于相应的接收数据的桶的 Sink 的每个 Subtask ,每个桶将至少包含一个 Part
文件。将根据配置的滚动策略来创建其他 Part 文件。
+对于 `Row-encoded Formats`(参考 [Format Types](#sink-format-types))默认的策略是根据 Part
文件大小进行滚动,需要指定文件打开状态最长时间的超时以及文件关闭后的非活动状态的超时时间。
+对于 `Bulk-encoded Formats` 我们在每次创建 Checkpoint 时进行滚动,并且用户也可以添加基于大小或者时间等的其他条件。
{{< hint info >}}
-**IMPORTANT**: Checkpointing needs to be enabled when using the `FileSink` in
`STREAMING` mode. Part files
-can only be finalized on successful checkpoints. If checkpointing is disabled,
part files will forever stay
-in the `in-progress` or the `pending` state, and cannot be safely read by
downstream systems.
+**重要**: 在 `STREAMING` 模式下使用 `FileSink` 需要开启 Checkpoint 功能。
+文件只在 Checkpoint 成功时生成。如果没有开启 Checkpoint 功能,文件将永远停留在 `in-progress` 或者 `pending`
的状态,并且下游系统将不能安全读取该文件数据。
{{< /hint >}}
{{< img src="/fig/streamfilesink_bucketing.png" width="100%" >}}
+<a name="sink-format-types"></a>
+
### Format Types
-The `FileSink` supports both row-wise and bulk encoding formats, such as
[Apache Parquet](http://parquet.apache.org).
-These two variants come with their respective builders that can be created
with the following static methods:
+`FileSink` 不仅支持 Row-encoded 也支持 Bulk-encoded,例如 [Apache
Parquet](http://parquet.apache.org)。
+这两种格式可以通过如下的静态方法进行构造:
- Row-encoded sink: `FileSink.forRowFormat(basePath, rowEncoder)`
- Bulk-encoded sink: `FileSink.forBulkFormat(basePath, bulkWriterFactory)`
-When creating either a row or a bulk encoded sink we have to specify the base
path where the buckets will be
-stored and the encoding logic for our data.
+不论创建 Row-encoded Format 或者 Bulk-encoded Format的 Sink 时,我们都必须指定桶的路径以及对数据进行编码的逻辑。
Review comment:
```suggestion
不论创建 Row-encoded Format 或者 Bulk-encoded Format 的 Sink
时,我们都必须指定桶的路径以及对数据进行编码的逻辑。
```
##########
File path: docs/content.zh/docs/connectors/datastream/filesystem.md
##########
@@ -811,61 +809,63 @@ input.sinkTo(sink)
{{< /tab >}}
{{< /tabs >}}
-The `SequenceFileWriterFactory` supports additional constructor parameters to
specify compression settings.
+`SequenceFileWriterFactory` 提供额外的构造参数去设置是否开启压缩功能。
+
+<a name="bucket-assignment"></a>
+
+### 桶分配
-### Bucket Assignment
+桶的逻辑定义了如何将数据分配到基本输出目录内的子目录中。
-The bucketing logic defines how the data will be structured into
subdirectories inside the base output directory.
+Row-encoded Format 和 Bulk-encoded Format (参考 [Format
Types](#sink-format-types)) 使用了 `DateTimeBucketAssigner` 作为默认的分配器。
+默认的分配器 `DateTimeBucketAssigner` 会基于使用了格式为 `yyyy-MM-dd--HH` 的系统默认时区来创建小时桶。日期格式(
*即* 桶大小)和时区都可以手动配置。
-Both row and bulk formats (see [File Formats](#file-formats)) use the
`DateTimeBucketAssigner` as the default assigner.
-By default the `DateTimeBucketAssigner` creates hourly buckets based on the
system default timezone
-with the following format: `yyyy-MM-dd--HH`. Both the date format (*i.e.*
bucket size) and timezone can be
-configured manually.
+我们可以在格式化构造器中通过调用 `.withBucketAssigner(assigner)` 方法去指定自定义的 `BucketAssigner`。
-We can specify a custom `BucketAssigner` by calling
`.withBucketAssigner(assigner)` on the format builders.
+Flink 内置了两种 BucketAssigners:
-Flink comes with two built-in BucketAssigners:
+- `DateTimeBucketAssigner` : 默认的基于时间的分配器
+- `BasePathBucketAssigner` : 分配所有文件存储在基础路径上(单个全局桶)
-- `DateTimeBucketAssigner` : Default time based assigner
-- `BasePathBucketAssigner` : Assigner that stores all part files in the base
path (single global bucket)
+<a name="rolling-policy"></a>
-### Rolling Policy
+### 滚动策略
-The `RollingPolicy` defines when a given in-progress part file will be closed
and moved to the pending and later to finished state.
-Part files in the "finished" state are the ones that are ready for viewing and
are guaranteed to contain valid data that will not be reverted in case of
failure.
-In `STREAMING` mode, the Rolling Policy in combination with the checkpointing
interval (pending files become finished on the next checkpoint) control how
quickly
-part files become available for downstream readers and also the size and
number of these parts. In `BATCH` mode, part-files become visible at the end of
the job but
-the rolling policy can control their maximum size.
+`RollingPolicy` 定义了何时关闭给定的进行中的文件,并将其转换为挂起状态,然后在转换为完成状态。
+完成状态的文件,可供查看并且可以保证数据的有效性,在出现故障时不会恢复。
+在 `STREAMING` 模式下,滚动策略结合 Checkpoint 间隔(到下一个 Checkpoint 完成时挂起状态的文件变成完成状态)共同控制
Part 文件对下游 readers 是否可见以及这些文件的大小和数量。在 `BATCH` 模式下,Part 文件在 Job
最后对下游才变得可见,滚动策略只控制最大的 Part 文件大小。
-Flink comes with two built-in RollingPolicies:
+Flink 内置了两种 RollingPolicies:
- `DefaultRollingPolicy`
- `OnCheckpointRollingPolicy`
-### Part file lifecycle
+<a name="part-file-lifecycle"></a>
-In order to use the output of the `FileSink` in downstream systems, we need to
understand the naming and lifecycle of the output files produced.
+### Part 文件生命周期
-Part files can be in one of three states:
-1. **In-progress** : The part file that is currently being written to is
in-progress
-2. **Pending** : Closed (due to the specified rolling policy) in-progress
files that are waiting to be committed
-3. **Finished** : On successful checkpoints (`STREAMING`) or at the end of
input (`BATCH`) pending files transition to "Finished"
+为了在下游使用 `FileSink` 作为输出,我们需要了解生成的输出文件的命名和生命周期。
-Only finished files are safe to read by downstream systems as those are
guaranteed to not be modified later.
+Part 文件可以处于以下三种状态中的任意一种:
+1. **In-progress** :当前正在被写入的 Part 文件处于 in-progress 状态
+2. **Pending** : (由于指定的滚动策略)关闭 in-progress 状态的文件,并且等待提交
+3. **Finished** : 流模式(`STREAMING`)下的成功的 Checkpoint
或者批模式(`BATCH`)下输入结束,挂起状态文件转换为完成状态
-Each writer subtask will have a single in-progress part file at any given time
for every active bucket, but there can be several pending and finished files.
+只有完成状态下的文件被下游读取时才是安全的,并且保证不会被修改。
-**Part file example**
+对于每个活动的桶,在任何给定时间每个写入 Subtask 中都有一个正在进行的 Part 文件,但可能有多个挂起和完成的文件。
-To better understand the lifecycle of these files let's look at a simple
example with 2 sink subtasks:
+**Part 文件示例**
Review comment:
link tag ?
##########
File path: docs/content.zh/docs/connectors/datastream/filesystem.md
##########
@@ -874,7 +874,7 @@ When the part file
`part-81fc4980-a6af-41c8-9937-9939408a734b-0` is rolled (let'
└──
part-81fc4980-a6af-41c8-9937-9939408a734b-1.inprogress.bc279efe-b16f-47d8-b828-00ef6e2fbd11
```
-As `part-81fc4980-a6af-41c8-9937-9939408a734b-0` is now pending completion,
after the next successful checkpoint, it is finalized:
+`part-81fc4980-a6af-41c8-9937-9939408a734b-0` 现在是挂起状态,并且在下一个 Checkpoint
成功过后,它就确定了:
Review comment:
```
`part-81fc4980-a6af-41c8-9937-9939408a734b-0` 现在是挂起状态,并且在下一个 Checkpoint
成功过后,它即成为完成状态:
```
##########
File path: docs/content.zh/docs/connectors/datastream/filesystem.md
##########
@@ -757,13 +753,15 @@ class PersonVectorizer(schema: String) extends
Vectorizer[Person](schema) {
{{< /tab >}}
{{< /tabs >}}
-##### Hadoop SequenceFile format
+<a name="hadoop-sequencefile-format"></a>
-To use the `SequenceFile` bulk encoder in your application you need to add the
following dependency:
+##### Hadoop SequenceFile Format
+
+在你的应用程序中使用 `SequenceFile` 的 Bulk-encoded Format,你需要添加如下依赖到项目中:
Review comment:
```suggestion
在你的应用程序中使用 `SequenceFile` 的 Bulk-encoded Format,需要添加如下依赖到项目中:
```
##########
File path: docs/content.zh/docs/connectors/datastream/filesystem.md
##########
@@ -710,18 +707,17 @@ val conf: Configuration = ...
val writerProperties: Properties = new Properties()
writerProperties.setProperty("orc.compress", "LZ4")
-// Other ORC supported properties can also be set similarly.
+// 其他 ORC 参数也可以使用类似方式进行设置
val writerFactory = new OrcBulkWriterFactory(
new PersonVectorizer(schema), writerProperties, conf)
```
{{< /tab >}}
{{< /tabs >}}
-The complete list of ORC writer properties can be found
[here](https://orc.apache.org/docs/hive-config.html).
+完整的 ORC 输出属性列表可以参考 [这篇文档](https://orc.apache.org/docs/hive-config.html) 。
Review comment:
```
完整的 ORC 输出属性列表可以参考 [此文档](https://orc.apache.org/docs/hive-config.html) 。
```
or
```
完整的 ORC 输出属性列表可以参考 [此链接](https://orc.apache.org/docs/hive-config.html) 。
```
##########
File path: docs/content.zh/docs/connectors/datastream/filesystem.md
##########
@@ -710,18 +707,17 @@ val conf: Configuration = ...
val writerProperties: Properties = new Properties()
writerProperties.setProperty("orc.compress", "LZ4")
-// Other ORC supported properties can also be set similarly.
+// 其他 ORC 参数也可以使用类似方式进行设置
val writerFactory = new OrcBulkWriterFactory(
new PersonVectorizer(schema), writerProperties, conf)
```
{{< /tab >}}
{{< /tabs >}}
-The complete list of ORC writer properties can be found
[here](https://orc.apache.org/docs/hive-config.html).
+完整的 ORC 输出属性列表可以参考 [这篇文档](https://orc.apache.org/docs/hive-config.html) 。
-Users who want to add user metadata to the ORC files can do so by calling
`addUserMetadata(...)` inside the overriding
-`vectorize(...)` method.
+用户希望添加他们自己的元数据到 ORC 文件中,可以调用 `addUserMetadata(...)` 方法来覆盖 `vectorize(...)`
方法的属性设置。
Review comment:
```suggestion
用户在重写 `vectorize(...)` 方法时可以调用 `addUserMetadata(...)` 方法来添加自己的元数据到 ORC 文件中。
```
##########
File path: docs/content.zh/docs/connectors/datastream/filesystem.md
##########
@@ -811,61 +809,63 @@ input.sinkTo(sink)
{{< /tab >}}
{{< /tabs >}}
-The `SequenceFileWriterFactory` supports additional constructor parameters to
specify compression settings.
+`SequenceFileWriterFactory` 提供额外的构造参数去设置是否开启压缩功能。
+
+<a name="bucket-assignment"></a>
+
+### 桶分配
-### Bucket Assignment
+桶的逻辑定义了如何将数据分配到基本输出目录内的子目录中。
-The bucketing logic defines how the data will be structured into
subdirectories inside the base output directory.
+Row-encoded Format 和 Bulk-encoded Format (参考 [Format
Types](#sink-format-types)) 使用了 `DateTimeBucketAssigner` 作为默认的分配器。
+默认的分配器 `DateTimeBucketAssigner` 会基于使用了格式为 `yyyy-MM-dd--HH` 的系统默认时区来创建小时桶。日期格式(
*即* 桶大小)和时区都可以手动配置。
-Both row and bulk formats (see [File Formats](#file-formats)) use the
`DateTimeBucketAssigner` as the default assigner.
-By default the `DateTimeBucketAssigner` creates hourly buckets based on the
system default timezone
-with the following format: `yyyy-MM-dd--HH`. Both the date format (*i.e.*
bucket size) and timezone can be
-configured manually.
+我们可以在格式化构造器中通过调用 `.withBucketAssigner(assigner)` 方法去指定自定义的 `BucketAssigner`。
-We can specify a custom `BucketAssigner` by calling
`.withBucketAssigner(assigner)` on the format builders.
+Flink 内置了两种 BucketAssigners:
-Flink comes with two built-in BucketAssigners:
+- `DateTimeBucketAssigner` : 默认的基于时间的分配器
+- `BasePathBucketAssigner` : 分配所有文件存储在基础路径上(单个全局桶)
-- `DateTimeBucketAssigner` : Default time based assigner
-- `BasePathBucketAssigner` : Assigner that stores all part files in the base
path (single global bucket)
+<a name="rolling-policy"></a>
-### Rolling Policy
+### 滚动策略
-The `RollingPolicy` defines when a given in-progress part file will be closed
and moved to the pending and later to finished state.
-Part files in the "finished" state are the ones that are ready for viewing and
are guaranteed to contain valid data that will not be reverted in case of
failure.
-In `STREAMING` mode, the Rolling Policy in combination with the checkpointing
interval (pending files become finished on the next checkpoint) control how
quickly
-part files become available for downstream readers and also the size and
number of these parts. In `BATCH` mode, part-files become visible at the end of
the job but
-the rolling policy can control their maximum size.
+`RollingPolicy` 定义了何时关闭给定的进行中的文件,并将其转换为挂起状态,然后在转换为完成状态。
+完成状态的文件,可供查看并且可以保证数据的有效性,在出现故障时不会恢复。
+在 `STREAMING` 模式下,滚动策略结合 Checkpoint 间隔(到下一个 Checkpoint 完成时挂起状态的文件变成完成状态)共同控制
Part 文件对下游 readers 是否可见以及这些文件的大小和数量。在 `BATCH` 模式下,Part 文件在 Job
最后对下游才变得可见,滚动策略只控制最大的 Part 文件大小。
-Flink comes with two built-in RollingPolicies:
+Flink 内置了两种 RollingPolicies:
- `DefaultRollingPolicy`
- `OnCheckpointRollingPolicy`
-### Part file lifecycle
+<a name="part-file-lifecycle"></a>
-In order to use the output of the `FileSink` in downstream systems, we need to
understand the naming and lifecycle of the output files produced.
+### Part 文件生命周期
-Part files can be in one of three states:
-1. **In-progress** : The part file that is currently being written to is
in-progress
-2. **Pending** : Closed (due to the specified rolling policy) in-progress
files that are waiting to be committed
-3. **Finished** : On successful checkpoints (`STREAMING`) or at the end of
input (`BATCH`) pending files transition to "Finished"
+为了在下游使用 `FileSink` 作为输出,我们需要了解生成的输出文件的命名和生命周期。
-Only finished files are safe to read by downstream systems as those are
guaranteed to not be modified later.
+Part 文件可以处于以下三种状态中的任意一种:
+1. **In-progress** :当前正在被写入的 Part 文件处于 in-progress 状态
+2. **Pending** : (由于指定的滚动策略)关闭 in-progress 状态的文件,并且等待提交
+3. **Finished** : 流模式(`STREAMING`)下的成功的 Checkpoint
或者批模式(`BATCH`)下输入结束,挂起状态文件转换为完成状态
-Each writer subtask will have a single in-progress part file at any given time
for every active bucket, but there can be several pending and finished files.
+只有完成状态下的文件被下游读取时才是安全的,并且保证不会被修改。
-**Part file example**
+对于每个活动的桶,在任何给定时间每个写入 Subtask 中都有一个正在进行的 Part 文件,但可能有多个挂起和完成的文件。
-To better understand the lifecycle of these files let's look at a simple
example with 2 sink subtasks:
+**Part 文件示例**
+
+为了更好的了解这些文件的生命周期,让我们看一个只有2个 Sink Subtask 的简单例子:
Review comment:
```suggestion
为了更好的了解这些文件的生命周期,让我们看一个只有 2 个 Sink Subtask 的简单例子:
```
##########
File path: docs/content.zh/docs/connectors/datastream/filesystem.md
##########
@@ -894,21 +894,22 @@ New buckets are created as dictated by the bucketing
policy, and this doesn't af
└──
part-4005733d-a830-4323-8291-8866de98b582-0.inprogress.2b475fec-1482-4dea-9946-eb4353b475f1
```
-Old buckets can still receive new records as the bucketing policy is evaluated
on a per-record basis.
+旧桶仍然可以接收新记录,因为桶策略是在每条记录上进行评估的。
Review comment:
```
旧桶仍然可以接收新记录,因为桶策略是在每条记录上进行使用的。
```
a minor comment.
##########
File path: docs/content.zh/docs/connectors/datastream/filesystem.md
##########
@@ -469,17 +471,17 @@ input.sinkTo(sink)
{{< /tab >}}
{{< /tabs >}}
-##### Avro format
+<a name="avro-format"></a>
-Flink also provides built-in support for writing data into Avro files. A list
of convenience methods to create
-Avro writer factories and their associated documentation can be found in the
-AvroWriters class.
+##### Avro Format
-To use the Avro writers in your application you need to add the following
dependency:
+Flink 也支持写入数据到 Avro Format 文件。在 AvroWriters 类中可以发现一系列创建 Avro writer
工厂的便利方法及其相关说明。
+
+在你的应用程序中可以使用 AvroWriters,需要添加如下依赖到项目中:
Review comment:
```suggestion
在你的应用程序中使用 AvroWriters,需要添加如下依赖到项目中:
```
##########
File path: docs/content.zh/docs/connectors/datastream/filesystem.md
##########
@@ -563,20 +564,17 @@ stream.sinkTo(FileSink.forBulkFormat(
{{< /tab >}}
{{< /tabs >}}
+<a name="orc-format"></a>
+
##### ORC Format
-To enable the data to be bulk encoded in ORC format, Flink offers
`OrcBulkWriterFactory`
-which takes a concrete implementation of Vectorizer.
+ORC Format 的数据采用 Bulk-encoded Format,Flink 提供了 Vectorizer 接口的具体实现类
`OrcBulkWriterFactory`。
-Like any other columnar format that encodes data in bulk fashion, Flink's
`OrcBulkWriter` writes the input elements in batches. It uses
-ORC's `VectorizedRowBatch` to achieve this.
+像其他列格式一样也是采用 Bulk-encoded Format,Flink 中的 `OrcBulkWriter` 以批的方式写出数据。是使用 ORC 的
`VectorizedRowBatch` 实现的。
-Since the input element has to be transformed to a `VectorizedRowBatch`, users
have to extend the abstract `Vectorizer`
-class and override the `vectorize(T element, VectorizedRowBatch batch)`
method. As you can see, the method provides an
-instance of `VectorizedRowBatch` to be used directly by the users so users
just have to write the logic to transform the
-input `element` to `ColumnVectors` and set them in the provided
`VectorizedRowBatch` instance.
+一旦输入数据已经被转换成了 `VectorizedRowBatch`,用户就必须继承抽象类 `Vectorizer` 并且覆写类中 `vectorize(T
element, VectorizedRowBatch batch)` 这个方法。正如你所看到的那样,此方法中提供了被用户直接使用的
`VectorizedRowBatch` 类的实例,因此,用户不得不编写从输入 `element` 到 `ColumnVectors`
的转换逻辑,并将它们设置在 `VectorizedRowBatch` 实例中。
Review comment:
```
由于输入数据已经被转换成了 `VectorizedRowBatch`,所以用户必须继承抽象类 `Vectorizer` 并且覆写类中
`vectorize(T element, VectorizedRowBatch batch)` 这个方法。正如你所看到的那样,此方法中提供了被用户直接使用的
`VectorizedRowBatch` 类的实例,因此,用户不得不编写从输入 `element` 到 `ColumnVectors`
的转换逻辑,并将它们设置在 `VectorizedRowBatch` 实例中。
```
##########
File path: docs/content.zh/docs/connectors/datastream/filesystem.md
##########
@@ -811,61 +809,63 @@ input.sinkTo(sink)
{{< /tab >}}
{{< /tabs >}}
-The `SequenceFileWriterFactory` supports additional constructor parameters to
specify compression settings.
+`SequenceFileWriterFactory` 提供额外的构造参数去设置是否开启压缩功能。
+
+<a name="bucket-assignment"></a>
+
+### 桶分配
-### Bucket Assignment
+桶的逻辑定义了如何将数据分配到基本输出目录内的子目录中。
-The bucketing logic defines how the data will be structured into
subdirectories inside the base output directory.
+Row-encoded Format 和 Bulk-encoded Format (参考 [Format
Types](#sink-format-types)) 使用了 `DateTimeBucketAssigner` 作为默认的分配器。
+默认的分配器 `DateTimeBucketAssigner` 会基于使用了格式为 `yyyy-MM-dd--HH` 的系统默认时区来创建小时桶。日期格式(
*即* 桶大小)和时区都可以手动配置。
-Both row and bulk formats (see [File Formats](#file-formats)) use the
`DateTimeBucketAssigner` as the default assigner.
-By default the `DateTimeBucketAssigner` creates hourly buckets based on the
system default timezone
-with the following format: `yyyy-MM-dd--HH`. Both the date format (*i.e.*
bucket size) and timezone can be
-configured manually.
+我们可以在格式化构造器中通过调用 `.withBucketAssigner(assigner)` 方法去指定自定义的 `BucketAssigner`。
-We can specify a custom `BucketAssigner` by calling
`.withBucketAssigner(assigner)` on the format builders.
+Flink 内置了两种 BucketAssigners:
-Flink comes with two built-in BucketAssigners:
+- `DateTimeBucketAssigner` : 默认的基于时间的分配器
+- `BasePathBucketAssigner` : 分配所有文件存储在基础路径上(单个全局桶)
-- `DateTimeBucketAssigner` : Default time based assigner
-- `BasePathBucketAssigner` : Assigner that stores all part files in the base
path (single global bucket)
+<a name="rolling-policy"></a>
-### Rolling Policy
+### 滚动策略
-The `RollingPolicy` defines when a given in-progress part file will be closed
and moved to the pending and later to finished state.
-Part files in the "finished" state are the ones that are ready for viewing and
are guaranteed to contain valid data that will not be reverted in case of
failure.
-In `STREAMING` mode, the Rolling Policy in combination with the checkpointing
interval (pending files become finished on the next checkpoint) control how
quickly
-part files become available for downstream readers and also the size and
number of these parts. In `BATCH` mode, part-files become visible at the end of
the job but
-the rolling policy can control their maximum size.
+`RollingPolicy` 定义了何时关闭给定的进行中的文件,并将其转换为挂起状态,然后在转换为完成状态。
+完成状态的文件,可供查看并且可以保证数据的有效性,在出现故障时不会恢复。
+在 `STREAMING` 模式下,滚动策略结合 Checkpoint 间隔(到下一个 Checkpoint 完成时挂起状态的文件变成完成状态)共同控制
Part 文件对下游 readers 是否可见以及这些文件的大小和数量。在 `BATCH` 模式下,Part 文件在 Job
最后对下游才变得可见,滚动策略只控制最大的 Part 文件大小。
-Flink comes with two built-in RollingPolicies:
+Flink 内置了两种 RollingPolicies:
- `DefaultRollingPolicy`
- `OnCheckpointRollingPolicy`
-### Part file lifecycle
+<a name="part-file-lifecycle"></a>
-In order to use the output of the `FileSink` in downstream systems, we need to
understand the naming and lifecycle of the output files produced.
+### Part 文件生命周期
-Part files can be in one of three states:
-1. **In-progress** : The part file that is currently being written to is
in-progress
-2. **Pending** : Closed (due to the specified rolling policy) in-progress
files that are waiting to be committed
-3. **Finished** : On successful checkpoints (`STREAMING`) or at the end of
input (`BATCH`) pending files transition to "Finished"
+为了在下游使用 `FileSink` 作为输出,我们需要了解生成的输出文件的命名和生命周期。
-Only finished files are safe to read by downstream systems as those are
guaranteed to not be modified later.
+Part 文件可以处于以下三种状态中的任意一种:
+1. **In-progress** :当前正在被写入的 Part 文件处于 in-progress 状态
+2. **Pending** : (由于指定的滚动策略)关闭 in-progress 状态的文件,并且等待提交
+3. **Finished** : 流模式(`STREAMING`)下的成功的 Checkpoint
或者批模式(`BATCH`)下输入结束,挂起状态文件转换为完成状态
Review comment:
```suggestion
3. **Finished** : 流模式(`STREAMING`)下的成功的 Checkpoint
或者批模式(`BATCH`)下输入结束,pending 文件转换为“Finished”状态
```
##########
File path: docs/content.zh/docs/connectors/datastream/filesystem.md
##########
@@ -811,61 +809,63 @@ input.sinkTo(sink)
{{< /tab >}}
{{< /tabs >}}
-The `SequenceFileWriterFactory` supports additional constructor parameters to
specify compression settings.
+`SequenceFileWriterFactory` 提供额外的构造参数去设置是否开启压缩功能。
+
+<a name="bucket-assignment"></a>
+
+### 桶分配
-### Bucket Assignment
+桶的逻辑定义了如何将数据分配到基本输出目录内的子目录中。
-The bucketing logic defines how the data will be structured into
subdirectories inside the base output directory.
+Row-encoded Format 和 Bulk-encoded Format (参考 [Format
Types](#sink-format-types)) 使用了 `DateTimeBucketAssigner` 作为默认的分配器。
+默认的分配器 `DateTimeBucketAssigner` 会基于使用了格式为 `yyyy-MM-dd--HH` 的系统默认时区来创建小时桶。日期格式(
*即* 桶大小)和时区都可以手动配置。
-Both row and bulk formats (see [File Formats](#file-formats)) use the
`DateTimeBucketAssigner` as the default assigner.
-By default the `DateTimeBucketAssigner` creates hourly buckets based on the
system default timezone
-with the following format: `yyyy-MM-dd--HH`. Both the date format (*i.e.*
bucket size) and timezone can be
-configured manually.
+我们可以在格式化构造器中通过调用 `.withBucketAssigner(assigner)` 方法去指定自定义的 `BucketAssigner`。
-We can specify a custom `BucketAssigner` by calling
`.withBucketAssigner(assigner)` on the format builders.
+Flink 内置了两种 BucketAssigners:
-Flink comes with two built-in BucketAssigners:
+- `DateTimeBucketAssigner` : 默认的基于时间的分配器
+- `BasePathBucketAssigner` : 分配所有文件存储在基础路径上(单个全局桶)
-- `DateTimeBucketAssigner` : Default time based assigner
-- `BasePathBucketAssigner` : Assigner that stores all part files in the base
path (single global bucket)
+<a name="rolling-policy"></a>
-### Rolling Policy
+### 滚动策略
-The `RollingPolicy` defines when a given in-progress part file will be closed
and moved to the pending and later to finished state.
-Part files in the "finished" state are the ones that are ready for viewing and
are guaranteed to contain valid data that will not be reverted in case of
failure.
-In `STREAMING` mode, the Rolling Policy in combination with the checkpointing
interval (pending files become finished on the next checkpoint) control how
quickly
-part files become available for downstream readers and also the size and
number of these parts. In `BATCH` mode, part-files become visible at the end of
the job but
-the rolling policy can control their maximum size.
+`RollingPolicy` 定义了何时关闭给定的进行中的文件,并将其转换为挂起状态,然后在转换为完成状态。
+完成状态的文件,可供查看并且可以保证数据的有效性,在出现故障时不会恢复。
+在 `STREAMING` 模式下,滚动策略结合 Checkpoint 间隔(到下一个 Checkpoint 完成时挂起状态的文件变成完成状态)共同控制
Part 文件对下游 readers 是否可见以及这些文件的大小和数量。在 `BATCH` 模式下,Part 文件在 Job
最后对下游才变得可见,滚动策略只控制最大的 Part 文件大小。
-Flink comes with two built-in RollingPolicies:
+Flink 内置了两种 RollingPolicies:
- `DefaultRollingPolicy`
- `OnCheckpointRollingPolicy`
-### Part file lifecycle
+<a name="part-file-lifecycle"></a>
-In order to use the output of the `FileSink` in downstream systems, we need to
understand the naming and lifecycle of the output files produced.
+### Part 文件生命周期
-Part files can be in one of three states:
-1. **In-progress** : The part file that is currently being written to is
in-progress
-2. **Pending** : Closed (due to the specified rolling policy) in-progress
files that are waiting to be committed
-3. **Finished** : On successful checkpoints (`STREAMING`) or at the end of
input (`BATCH`) pending files transition to "Finished"
+为了在下游使用 `FileSink` 作为输出,我们需要了解生成的输出文件的命名和生命周期。
-Only finished files are safe to read by downstream systems as those are
guaranteed to not be modified later.
+Part 文件可以处于以下三种状态中的任意一种:
+1. **In-progress** :当前正在被写入的 Part 文件处于 in-progress 状态
+2. **Pending** : (由于指定的滚动策略)关闭 in-progress 状态的文件,并且等待提交
+3. **Finished** : 流模式(`STREAMING`)下的成功的 Checkpoint
或者批模式(`BATCH`)下输入结束,挂起状态文件转换为完成状态
Review comment:
Please keep coincident with original content for
`In-progress`,`Pending`,`Finished` in the rest of the page.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]