JingGe commented on a change in pull request #18646:
URL: https://github.com/apache/flink/pull/18646#discussion_r828939974
##########
File path: docs/content.zh/docs/connectors/datastream/formats/parquet.md
##########
@@ -61,146 +62,144 @@ To read Avro records, you will need to add the
`parquet-avro` dependency:
</dependency>
```
-This format is compatible with the new Source that can be used in both batch
and streaming execution modes.
-Thus, you can use this format for two kinds of data:
+此格式与新的 Source 兼容,可以同时在批和流模式下使用。
+因此,你可使用此格式处理以下两类数据:
-- Bounded data: lists all files and reads them all.
-- Unbounded data: monitors a directory for new files that appear.
+- 批处理模式下的有界读取
Review comment:
```suggestion
- 有界数据: 列出所有文件并全部读取。
```
##########
File path: docs/content.zh/docs/connectors/datastream/formats/parquet.md
##########
@@ -61,146 +62,144 @@ To read Avro records, you will need to add the
`parquet-avro` dependency:
</dependency>
```
-This format is compatible with the new Source that can be used in both batch
and streaming execution modes.
-Thus, you can use this format for two kinds of data:
+此格式与新的 Source 兼容,可以同时在批和流模式下使用。
+因此,你可使用此格式处理以下两类数据:
-- Bounded data: lists all files and reads them all.
-- Unbounded data: monitors a directory for new files that appear.
+- 批处理模式下的有界读取
+- 流处理模式下的连续读取:监控目录中出现的新文件
Review comment:
```suggestion
- 无界数据:监控目录中出现的新文件
```
##########
File path: docs/content.zh/docs/connectors/datastream/formats/parquet.md
##########
@@ -218,17 +217,17 @@ The following example uses the example schema
[testdata.avsc](https://github.com
]
```
-You will use the Avro Maven plugin to generate the `Address` Java class:
+你可以使用 Avro Maven plugin 生成 `Address` Java 类。
```java
@org.apache.avro.specific.AvroGenerated
public class Address extends org.apache.avro.specific.SpecificRecordBase
implements org.apache.avro.specific.SpecificRecord {
- // generated code...
+ // 生成的代码...
}
```
-You will create an `AvroParquetRecordFormat` via `AvroParquetReaders` for Avro
Specific record
-and then create a DataStream containing Parquet records as Avro Specific
records.
+你可以通过 `AvroParquetReaders` 为 Avro Specific 记录创建 `AvroParquetRecordFormat`,
+然后创建一个包含 Parquet 记录作为 Avro Specific 记录的 DateStream。
Review comment:
```suggestion
然后创建一个包含由 Avro Specific records 格式构成的 Parquet records 的 DateStream。
```
##########
File path: docs/content.zh/docs/connectors/datastream/formats/parquet.md
##########
@@ -61,146 +62,144 @@ To read Avro records, you will need to add the
`parquet-avro` dependency:
</dependency>
```
-This format is compatible with the new Source that can be used in both batch
and streaming execution modes.
-Thus, you can use this format for two kinds of data:
+此格式与新的 Source 兼容,可以同时在批和流模式下使用。
+因此,你可使用此格式处理以下两类数据:
-- Bounded data: lists all files and reads them all.
-- Unbounded data: monitors a directory for new files that appear.
+- 批处理模式下的有界读取
+- 流处理模式下的连续读取:监控目录中出现的新文件
{{< hint info >}}
-When you start a File Source it is configured for bounded data by default.
-To configure the File Source for unbounded data, you must additionally call
-`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`.
+当你开启一个 File Source,会被默认为有界读取。
+如果你想在连续读取模式下使用 File Source,你必须额外调用
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`。
{{< /hint >}}
**Vectorized reader**
```java
-
// Parquet rows are decoded in batches
FileSource.forBulkFileFormat(BulkFormat,Path...)
-
// Monitor the Paths to read data as unbounded data
FileSource.forBulkFileFormat(BulkFormat,Path...)
- .monitorContinuously(Duration.ofMillis(5L))
- .build();
-
+.monitorContinuously(Duration.ofMillis(5L))
+.build();
```
**Avro Parquet reader**
```java
-
// Parquet rows are decoded in batches
FileSource.forRecordStreamFormat(StreamFormat,Path...)
-
// Monitor the Paths to read data as unbounded data
FileSource.forRecordStreamFormat(StreamFormat,Path...)
.monitorContinuously(Duration.ofMillis(5L))
.build();
-
-
```
{{< hint info >}}
-Following examples are all configured for bounded data.
-To configure the File Source for unbounded data, you must additionally call
-`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`.
+下面的案例都是基于有界数据的。
+如果你想在连续读取模式下使用 File Source,你必须额外调用
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`。
{{< /hint >}}
+<a name="flink-rowdata"></a>
+
## Flink RowData
-In this example, you will create a DataStream containing Parquet records as
Flink RowDatas. The schema is projected to read only the specified fields
("f7", "f4" and "f99").
-Flink will read records in batches of 500 records. The first boolean parameter
specifies that timestamp columns will be interpreted as UTC.
-The second boolean instructs the application that the projected Parquet fields
names are case-sensitive.
-There is no watermark strategy defined as records do not contain event
timestamps.
+在这个示例中,你将创建由 Parquet 格式的记录构成的 Flink RowDatas DataStream。我们把 schema
信息映射为只读字段("f7"、"f4" 和 "f99")。
Review comment:
```suggestion
在此示例中,你将创建由 Parquet 格式的记录构成的 Flink RowDatas DataStream。我们把 schema
信息映射为只读字段("f7"、"f4" 和 "f99")。
```
##########
File path: docs/content.zh/docs/connectors/datastream/formats/parquet.md
##########
@@ -61,146 +62,144 @@ To read Avro records, you will need to add the
`parquet-avro` dependency:
</dependency>
```
-This format is compatible with the new Source that can be used in both batch
and streaming execution modes.
-Thus, you can use this format for two kinds of data:
+此格式与新的 Source 兼容,可以同时在批和流模式下使用。
+因此,你可使用此格式处理以下两类数据:
-- Bounded data: lists all files and reads them all.
-- Unbounded data: monitors a directory for new files that appear.
+- 批处理模式下的有界读取
+- 流处理模式下的连续读取:监控目录中出现的新文件
{{< hint info >}}
-When you start a File Source it is configured for bounded data by default.
-To configure the File Source for unbounded data, you must additionally call
-`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`.
+当你开启一个 File Source,会被默认为有界读取。
+如果你想在连续读取模式下使用 File Source,你必须额外调用
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`。
{{< /hint >}}
**Vectorized reader**
```java
-
// Parquet rows are decoded in batches
FileSource.forBulkFileFormat(BulkFormat,Path...)
-
// Monitor the Paths to read data as unbounded data
FileSource.forBulkFileFormat(BulkFormat,Path...)
- .monitorContinuously(Duration.ofMillis(5L))
- .build();
-
+.monitorContinuously(Duration.ofMillis(5L))
+.build();
```
**Avro Parquet reader**
```java
-
// Parquet rows are decoded in batches
FileSource.forRecordStreamFormat(StreamFormat,Path...)
-
// Monitor the Paths to read data as unbounded data
FileSource.forRecordStreamFormat(StreamFormat,Path...)
.monitorContinuously(Duration.ofMillis(5L))
.build();
-
-
```
{{< hint info >}}
-Following examples are all configured for bounded data.
-To configure the File Source for unbounded data, you must additionally call
-`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`.
+下面的案例都是基于有界数据的。
+如果你想在连续读取模式下使用 File Source,你必须额外调用
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`。
{{< /hint >}}
+<a name="flink-rowdata"></a>
+
## Flink RowData
-In this example, you will create a DataStream containing Parquet records as
Flink RowDatas. The schema is projected to read only the specified fields
("f7", "f4" and "f99").
-Flink will read records in batches of 500 records. The first boolean parameter
specifies that timestamp columns will be interpreted as UTC.
-The second boolean instructs the application that the projected Parquet fields
names are case-sensitive.
-There is no watermark strategy defined as records do not contain event
timestamps.
+在这个示例中,你将创建由 Parquet 格式的记录构成的 Flink RowDatas DataStream。我们把 schema
信息映射为只读字段("f7"、"f4" 和 "f99")。
+每个批次读取 500 条记录。其中,第一个布尔类型的参数用来指定是否需要将时间戳列处理为 UTC。
+第二个布尔类型参数用来指定在进行 Parquet 字段映射时,是否要区分大小写。
+这里不需要水印策略,因为记录中不包含事件时间戳。
+
```java
final LogicalType[] fieldTypes =
- new LogicalType[] {
- new DoubleType(), new IntType(), new VarCharType()
- };
+ new LogicalType[] {
+ new DoubleType(), new IntType(), new VarCharType()
+ };
final ParquetColumnarRowInputFormat<FileSourceSplit> format =
- new ParquetColumnarRowInputFormat<>(
- new Configuration(),
- RowType.of(fieldTypes, new String[] {"f7", "f4", "f99"}),
- 500,
- false,
- true);
-final FileSource<RowData> source =
+ new ParquetColumnarRowInputFormat<>(
+ new Configuration(),
+ RowType.of(fieldTypes, new String[] {"f7", "f4", "f99"}),
+ 500,
+ false,
+ true);
+
+final FileSource<RowData> source
FileSource.forBulkFileFormat(format, /* Flink Path */)
.build();
+
final DataStream<RowData> stream =
- env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
+ env.fromSource(source, WatermarkStrategy.noWatermarks(),
"file-source");
```
## Avro Records
-Flink supports producing three types of Avro records by reading Parquet files:
+Flink 支持三种通过读取 Parquet 文件创建 Avro records 的方式:
- [Generic record](https://avro.apache.org/docs/1.10.0/api/java/index.html)
- [Specific record](https://avro.apache.org/docs/1.10.0/api/java/index.html)
- [Reflect
record](https://avro.apache.org/docs/1.10.0/api/java/org/apache/avro/reflect/package-summary.html)
### Generic record
-Avro schemas are defined using JSON. You can get more information about Avro
schemas and types from the [Avro
specification](https://avro.apache.org/docs/1.10.0/spec.html).
-This example uses an Avro schema example similar to the one described in the
[official Avro
tutorial](https://avro.apache.org/docs/1.10.0/gettingstartedjava.html):
+使用 JSON 定义 Avro schemas。你可以从 [Avro
specification](https://avro.apache.org/docs/1.10.0/spec.html) 获取更多关于 Avro
schemas 和类型的信息。
+这个示例使用了 [official Avro
tutorial](https://avro.apache.org/docs/1.10.0/gettingstartedjava.html)
中描述的示例相似的 Avro schema:
```json lines
{"namespace": "example.avro",
- "type": "record",
- "name": "User",
- "fields": [
- {"name": "name", "type": "string"},
- {"name": "favoriteNumber", "type": ["int", "null"]},
- {"name": "favoriteColor", "type": ["string", "null"]}
- ]
+ "type": "record",
+ "name": "User",
+ "fields": [
+ {"name": "name", "type": "string"},
+ {"name": "favoriteNumber", "type": ["int", "null"]},
+ {"name": "favoriteColor", "type": ["string", "null"]}
+ ]
}
```
+这个 schema 定义了一个具有三个属性的的 user 记录:name,favoriteNumber 和 favoriteColor。你可以
+在 [record
specification](https://avro.apache.org/docs/1.10.0/spec.html#schema_record)
找到更多关于如何定义 Avro schema 的详细信息。
-This schema defines a record representing a user with three fields: name,
favoriteNumber, and favoriteColor. You can find more details at [record
specification](https://avro.apache.org/docs/1.10.0/spec.html#schema_record) for
how to define an Avro schema.
-
-In the following example, you will create a DataStream containing Parquet
records as Avro Generic records.
-It will parse the Avro schema based on the JSON string. There are many other
ways to parse a schema, e.g. from java.io.File or java.io.InputStream. Please
refer to [Avro
Schema](https://avro.apache.org/docs/1.10.0/api/java/org/apache/avro/Schema.html)
for details.
-After that, you will create an `AvroParquetRecordFormat` via
`AvroParquetReaders` for Avro Generic records.
+在这个示例中,你将创建由 Parquet 格式的记录构成的 Avro Generic 记录的 DataStream。
Review comment:
```suggestion
在此示例中,你将创建包含由 Avro Generic records 格式构成的 Parquet records 的 DataStream。
```
##########
File path: docs/content.zh/docs/connectors/datastream/formats/parquet.md
##########
@@ -287,8 +286,8 @@ public class Datum implements Serializable {
}
```
-You will create an `AvroParquetRecordFormat` via `AvroParquetReaders` for Avro
Reflect record
-and then create a DataStream containing Parquet records as Avro Reflect
records.
+你可以通过 `AvroParquetReaders` 为 Avro Reflect 记录创建一个 `AvroParquetRecordFormat`,
+然后创建一个包含 Parquet 记录作为 Avro Reflect 记录的 DataStream。
Review comment:
```suggestion
然后创建一个包含由 Avro Reflect records 格式构成的 Parquet records 的 DateStream。
```
##########
File path: docs/content.zh/docs/connectors/datastream/formats/parquet.md
##########
@@ -349,45 +345,44 @@ favoriteColor: BINARY UNCOMPRESSED DO:0 FPO:92
SZ:55/55/1.00 VC:3 ENC:RLE,PLAI
```
-With the `User` class defined in the package
org.apache.flink.formats.parquet.avro:
+包路径 `org.apache.flink.formats.parquet.avro` 下创建的对象如下:
Review comment:
```suggestion
使用包 `org.apache.flink.formats.parquet.avro` 路径下已定义的 User 类:
```
##########
File path: docs/content.zh/docs/connectors/datastream/formats/parquet.md
##########
@@ -303,14 +302,12 @@ final DataStream<GenericRecord> stream =
env.fromSource(source, WatermarkStrategy.noWatermarks(),
"file-source");
```
-#### Prerequisite for Parquet files
+### 使用 Parquet files 必备条件
-In order to support reading Avro reflect records, the Parquet file must
contain specific meta information.
-The Avro schema used for creating the Parquet data must contain a `namespace`,
-which will be used by the program to identify the concrete Java class for the
reflection process.
+为了支持读取 Avro Reflect 数据,Parquet 文件必须包含特定的 meta 信息。为了生成 Parquet 数据,Avro schema
信息中必须包含 namespace,
+以便让程序在反射执行过程中能确定唯一的 Java Class 对象。
-The following example shows the `User` schema used previously. But this time
it contains a namespace
-pointing to the location(in this case the package), where the `User` class for
the reflection could be found.
+下面的案例展示了上文中的 User 对象的 schema 信息。但是当前案例包含了一个指定文件目录的
namespace(当前案例下的包路径),反射过程中可以找到对应的 User 文件。
Review comment:
```suggestion
下面的案例展示了上文中的 User 对象的 schema 信息。但是当前案例包含了一个指定文件目录的
namespace(当前案例下的包路径),反射过程中可以找到对应的 User 类。
```
##########
File path: docs/content.zh/docs/connectors/datastream/formats/parquet.md
##########
@@ -349,45 +345,44 @@ favoriteColor: BINARY UNCOMPRESSED DO:0 FPO:92
SZ:55/55/1.00 VC:3 ENC:RLE,PLAI
```
-With the `User` class defined in the package
org.apache.flink.formats.parquet.avro:
+包路径 `org.apache.flink.formats.parquet.avro` 下创建的对象如下:
```java
public class User {
- private String name;
- private Integer favoriteNumber;
- private String favoriteColor;
+ private String name;
+ private Integer favoriteNumber;
+ private String favoriteColor;
- public User() {}
+ public User() {}
- public User(String name, Integer favoriteNumber, String favoriteColor)
{
- this.name = name;
- this.favoriteNumber = favoriteNumber;
- this.favoriteColor = favoriteColor;
- }
+ public User(String name, Integer favoriteNumber, String favoriteColor) {
+ this.name = name;
+ this.favoriteNumber = favoriteNumber;
+ this.favoriteColor = favoriteColor;
+ }
- public String getName() {
- return name;
- }
+ public String getName() {
+ return name;
+ }
- public Integer getFavoriteNumber() {
- return favoriteNumber;
- }
+ public Integer getFavoriteNumber() {
+ return favoriteNumber;
+ }
- public String getFavoriteColor() {
- return favoriteColor;
- }
+ public String getFavoriteColor() {
+ return favoriteColor;
}
+}
```
-you can write the following program to read Avro Reflect records of User type
from parquet files:
+你可以通过下面的程序读取类型为 User 的 Avro Reflect 数据:
Review comment:
```suggestion
你可以通过下面的程序读取类型为 User 的 Avro Reflect records:
```
##########
File path: docs/content.zh/docs/connectors/datastream/formats/parquet.md
##########
@@ -61,146 +62,144 @@ To read Avro records, you will need to add the
`parquet-avro` dependency:
</dependency>
```
-This format is compatible with the new Source that can be used in both batch
and streaming execution modes.
-Thus, you can use this format for two kinds of data:
+此格式与新的 Source 兼容,可以同时在批和流模式下使用。
+因此,你可使用此格式处理以下两类数据:
-- Bounded data: lists all files and reads them all.
-- Unbounded data: monitors a directory for new files that appear.
+- 批处理模式下的有界读取
+- 流处理模式下的连续读取:监控目录中出现的新文件
{{< hint info >}}
-When you start a File Source it is configured for bounded data by default.
-To configure the File Source for unbounded data, you must additionally call
-`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`.
+当你开启一个 File Source,会被默认为有界读取。
+如果你想在连续读取模式下使用 File Source,你必须额外调用
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`。
{{< /hint >}}
**Vectorized reader**
```java
-
// Parquet rows are decoded in batches
FileSource.forBulkFileFormat(BulkFormat,Path...)
-
// Monitor the Paths to read data as unbounded data
FileSource.forBulkFileFormat(BulkFormat,Path...)
- .monitorContinuously(Duration.ofMillis(5L))
- .build();
-
+.monitorContinuously(Duration.ofMillis(5L))
+.build();
```
**Avro Parquet reader**
```java
-
// Parquet rows are decoded in batches
FileSource.forRecordStreamFormat(StreamFormat,Path...)
-
// Monitor the Paths to read data as unbounded data
FileSource.forRecordStreamFormat(StreamFormat,Path...)
.monitorContinuously(Duration.ofMillis(5L))
.build();
-
-
```
{{< hint info >}}
-Following examples are all configured for bounded data.
-To configure the File Source for unbounded data, you must additionally call
-`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`.
+下面的案例都是基于有界数据的。
+如果你想在连续读取模式下使用 File Source,你必须额外调用
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`。
{{< /hint >}}
+<a name="flink-rowdata"></a>
+
## Flink RowData
-In this example, you will create a DataStream containing Parquet records as
Flink RowDatas. The schema is projected to read only the specified fields
("f7", "f4" and "f99").
-Flink will read records in batches of 500 records. The first boolean parameter
specifies that timestamp columns will be interpreted as UTC.
-The second boolean instructs the application that the projected Parquet fields
names are case-sensitive.
-There is no watermark strategy defined as records do not contain event
timestamps.
+在这个示例中,你将创建由 Parquet 格式的记录构成的 Flink RowDatas DataStream。我们把 schema
信息映射为只读字段("f7"、"f4" 和 "f99")。
+每个批次读取 500 条记录。其中,第一个布尔类型的参数用来指定是否需要将时间戳列处理为 UTC。
+第二个布尔类型参数用来指定在进行 Parquet 字段映射时,是否要区分大小写。
+这里不需要水印策略,因为记录中不包含事件时间戳。
+
```java
final LogicalType[] fieldTypes =
- new LogicalType[] {
- new DoubleType(), new IntType(), new VarCharType()
- };
+ new LogicalType[] {
+ new DoubleType(), new IntType(), new VarCharType()
+ };
final ParquetColumnarRowInputFormat<FileSourceSplit> format =
- new ParquetColumnarRowInputFormat<>(
- new Configuration(),
- RowType.of(fieldTypes, new String[] {"f7", "f4", "f99"}),
- 500,
- false,
- true);
-final FileSource<RowData> source =
+ new ParquetColumnarRowInputFormat<>(
+ new Configuration(),
+ RowType.of(fieldTypes, new String[] {"f7", "f4", "f99"}),
+ 500,
+ false,
+ true);
+
+final FileSource<RowData> source
FileSource.forBulkFileFormat(format, /* Flink Path */)
.build();
+
final DataStream<RowData> stream =
- env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
+ env.fromSource(source, WatermarkStrategy.noWatermarks(),
"file-source");
```
## Avro Records
-Flink supports producing three types of Avro records by reading Parquet files:
+Flink 支持三种通过读取 Parquet 文件创建 Avro records 的方式:
- [Generic record](https://avro.apache.org/docs/1.10.0/api/java/index.html)
- [Specific record](https://avro.apache.org/docs/1.10.0/api/java/index.html)
- [Reflect
record](https://avro.apache.org/docs/1.10.0/api/java/org/apache/avro/reflect/package-summary.html)
### Generic record
-Avro schemas are defined using JSON. You can get more information about Avro
schemas and types from the [Avro
specification](https://avro.apache.org/docs/1.10.0/spec.html).
-This example uses an Avro schema example similar to the one described in the
[official Avro
tutorial](https://avro.apache.org/docs/1.10.0/gettingstartedjava.html):
+使用 JSON 定义 Avro schemas。你可以从 [Avro
specification](https://avro.apache.org/docs/1.10.0/spec.html) 获取更多关于 Avro
schemas 和类型的信息。
+这个示例使用了 [official Avro
tutorial](https://avro.apache.org/docs/1.10.0/gettingstartedjava.html)
中描述的示例相似的 Avro schema:
Review comment:
```suggestion
此示例使用了一个在 [official Avro
tutorial](https://avro.apache.org/docs/1.10.0/gettingstartedjava.html)
中描述的示例相似的 Avro schema:
```
##########
File path: docs/content.zh/docs/connectors/datastream/formats/parquet.md
##########
@@ -61,146 +62,144 @@ To read Avro records, you will need to add the
`parquet-avro` dependency:
</dependency>
```
-This format is compatible with the new Source that can be used in both batch
and streaming execution modes.
-Thus, you can use this format for two kinds of data:
+此格式与新的 Source 兼容,可以同时在批和流模式下使用。
+因此,你可使用此格式处理以下两类数据:
-- Bounded data: lists all files and reads them all.
-- Unbounded data: monitors a directory for new files that appear.
+- 批处理模式下的有界读取
+- 流处理模式下的连续读取:监控目录中出现的新文件
{{< hint info >}}
-When you start a File Source it is configured for bounded data by default.
-To configure the File Source for unbounded data, you must additionally call
-`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`.
+当你开启一个 File Source,会被默认为有界读取。
+如果你想在连续读取模式下使用 File Source,你必须额外调用
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`。
{{< /hint >}}
**Vectorized reader**
```java
-
// Parquet rows are decoded in batches
FileSource.forBulkFileFormat(BulkFormat,Path...)
-
// Monitor the Paths to read data as unbounded data
FileSource.forBulkFileFormat(BulkFormat,Path...)
- .monitorContinuously(Duration.ofMillis(5L))
- .build();
-
+.monitorContinuously(Duration.ofMillis(5L))
+.build();
```
**Avro Parquet reader**
```java
-
// Parquet rows are decoded in batches
FileSource.forRecordStreamFormat(StreamFormat,Path...)
-
// Monitor the Paths to read data as unbounded data
FileSource.forRecordStreamFormat(StreamFormat,Path...)
.monitorContinuously(Duration.ofMillis(5L))
.build();
-
-
```
{{< hint info >}}
-Following examples are all configured for bounded data.
-To configure the File Source for unbounded data, you must additionally call
-`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`.
+下面的案例都是基于有界数据的。
+如果你想在连续读取模式下使用 File Source,你必须额外调用
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`。
{{< /hint >}}
+<a name="flink-rowdata"></a>
+
## Flink RowData
-In this example, you will create a DataStream containing Parquet records as
Flink RowDatas. The schema is projected to read only the specified fields
("f7", "f4" and "f99").
-Flink will read records in batches of 500 records. The first boolean parameter
specifies that timestamp columns will be interpreted as UTC.
-The second boolean instructs the application that the projected Parquet fields
names are case-sensitive.
-There is no watermark strategy defined as records do not contain event
timestamps.
+在这个示例中,你将创建由 Parquet 格式的记录构成的 Flink RowDatas DataStream。我们把 schema
信息映射为只读字段("f7"、"f4" 和 "f99")。
+每个批次读取 500 条记录。其中,第一个布尔类型的参数用来指定是否需要将时间戳列处理为 UTC。
+第二个布尔类型参数用来指定在进行 Parquet 字段映射时,是否要区分大小写。
+这里不需要水印策略,因为记录中不包含事件时间戳。
+
```java
final LogicalType[] fieldTypes =
- new LogicalType[] {
- new DoubleType(), new IntType(), new VarCharType()
- };
+ new LogicalType[] {
+ new DoubleType(), new IntType(), new VarCharType()
+ };
final ParquetColumnarRowInputFormat<FileSourceSplit> format =
- new ParquetColumnarRowInputFormat<>(
- new Configuration(),
- RowType.of(fieldTypes, new String[] {"f7", "f4", "f99"}),
- 500,
- false,
- true);
-final FileSource<RowData> source =
+ new ParquetColumnarRowInputFormat<>(
+ new Configuration(),
+ RowType.of(fieldTypes, new String[] {"f7", "f4", "f99"}),
+ 500,
+ false,
+ true);
+
+final FileSource<RowData> source
FileSource.forBulkFileFormat(format, /* Flink Path */)
.build();
+
final DataStream<RowData> stream =
- env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
+ env.fromSource(source, WatermarkStrategy.noWatermarks(),
"file-source");
```
## Avro Records
-Flink supports producing three types of Avro records by reading Parquet files:
+Flink 支持三种通过读取 Parquet 文件创建 Avro records 的方式:
Review comment:
```suggestion
Flink 支持三种方式来读取 Parquet 文件并创建 Avro records :
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]