luoyuxia commented on code in PR #20227:
URL: https://github.com/apache/flink/pull/20227#discussion_r919570445
##########
docs/content.zh/docs/connectors/table/hive/hive_read_write.md:
##########
@@ -102,124 +91,119 @@ FROM hive_table
```
-**Notes**
+**注意**
-- Monitor strategy is to scan all directories/files currently in the location
path. Many partitions may cause performance degradation.
-- Streaming reads for non-partitioned tables requires that each file be
written atomically into the target directory.
-- Streaming reading for partitioned tables requires that each partition should
be added atomically in the view of hive metastore. If not, new data added to an
existing partition will be consumed.
-- Streaming reads do not support watermark grammar in Flink DDL. These tables
cannot be used for window operators.
+- 监控策略是扫描当前位置路径中的所有目录/文件,分区太多可能导致性能下降。
+- 流读非分区表时要求每个文件应原子地写入目标目录。
+- 流读分区表要求每个分区应该被原子地添加进 Hive metastore 中。如果不是的话,只有添加到现有分区的新数据会被消费。
+- 流读 Hive 表不支持 Flink DDL 的 watermark 语法。这些表不能被用于窗口算子。
-### Reading Hive Views
+### 读取 Hive Views
-Flink is able to read from Hive defined views, but some limitations apply:
+Flink 能够读取 Hive 中已经定义的视图。但是也有一些限制:
-1) The Hive catalog must be set as the current catalog before you can query
the view.
-This can be done by either `tableEnv.useCatalog(...)` in Table API or `USE
CATALOG ...` in SQL Client.
+1) Hive catalog 必须设置成当前的 catalog 才能查询视图。在 Table API 中使用
`tableEnv.useCatalog(...)`,或者在 SQL 客户端使用 `USE CATALOG ...` 来改变当前 catalog。
-2) Hive and Flink SQL have different syntax, e.g. different reserved keywords
and literals.
-Make sure the view’s query is compatible with Flink grammar.
+2) Hive 和 Flink SQL 的语法不同, 比如不同的关键字和字面值。确保查询视图与 Flink 语法兼容。
Review Comment:
```suggestion
2) Hive 和 Flink SQL 的语法不同, 比如不同的关键字和字面值。请确保对视图的查询语法与 Flink 语法兼容。
```
##########
docs/content.zh/docs/connectors/table/hive/hive_read_write.md:
##########
@@ -25,74 +25,63 @@ specific language governing permissions and limitations
under the License.
-->
-# Hive Read & Write
+# Hive 读 & 写
-Using the `HiveCatalog`, Apache Flink can be used for unified `BATCH` and
`STREAM` processing of Apache
-Hive Tables. This means Flink can be used as a more performant alternative to
Hive’s batch engine,
-or to continuously read and write data into and out of Hive tables to power
real-time data
-warehousing applications.
+通过使用 `HiveCatalog`,Apache Flink 可以对 Apache Hive 表做统一的批和流处理。这意味着 Flink 可以成为
Hive 批处理引擎的一个性能更好的选择,或者连续读写 Hive 表中的数据以支持实时数据仓库应用。
-## Reading
+## 读
-Flink supports reading data from Hive in both `BATCH` and `STREAMING` modes.
When run as a `BATCH`
-application, Flink will execute its query over the state of the table at the
point in time when the
-query is executed. `STREAMING` reads will continuously monitor the table and
incrementally fetch
-new data as it is made available. Flink will read tables as bounded by default.
+Flink 支持以批和流两种模式从 Hive 表中读取数据。批读的时候,Flink
会基于执行查询时表的状态进行查询。流读时将持续监控表,并在表中新数据可用时进行增量获取,默认情况下,Flink 将以批模式读取数据。
-`STREAMING` reads support consuming both partitioned and non-partitioned
tables.
-For partitioned tables, Flink will monitor the generation of new partitions,
and read
-them incrementally when available. For non-partitioned tables, Flink will
monitor the generation
-of new files in the folder and read new files incrementally.
+流读支持消费分区表和非分区表。对于分区表,Flink 会监控新分区的生成,并且在数据可用的情况下增量获取数据。对于非分区表,Flink
将监控文件夹中新文件的生成,并增量地读取新文件。
<table class="table table-bordered">
<thead>
<tr>
- <th class="text-left" style="width: 20%">Key</th>
- <th class="text-left" style="width: 15%">Default</th>
- <th class="text-left" style="width: 10%">Type</th>
- <th class="text-left" style="width: 55%">Description</th>
+ <th class="text-left" style="width: 20%">键</th>
+ <th class="text-left" style="width: 15%">默认值</th>
+ <th class="text-left" style="width: 10%">类型</th>
+ <th class="text-left" style="width: 55%">描述</th>
</tr>
</thead>
<tbody>
<tr>
<td><h5>streaming-source.enable</h5></td>
<td style="word-wrap: break-word;">false</td>
<td>Boolean</td>
- <td>Enable streaming source or not. NOTES: Please make sure that each
partition/file should be written atomically, otherwise the reader may get
incomplete data.</td>
+ <td>是否启动流读。注意:请确保每个分区/文件都应该原子地写入,否则读取不到完整的数据。</td>
</tr>
<tr>
<td><h5>streaming-source.partition.include</h5></td>
<td style="word-wrap: break-word;">all</td>
<td>String</td>
- <td>Option to set the partitions to read, the supported option are
`all` and `latest`, the `all` means read all partitions; the `latest` means
read latest partition in order of 'streaming-source.partition.order', the
`latest` only works` when the streaming hive source table used as temporal
table. By default the option is `all`.
- Flink supports temporal join the latest hive partition by enabling
'streaming-source.enable' and setting 'streaming-source.partition.include' to
'latest', at the same time, user can assign the partition compare order and
data update interval by configuring following partition-related options.
+ <td>选择读取的分区,可选项为 `all` 和 `latest`,`all` 读取所有分区;`latest` 读取按照
'streaming-source.partition.order' 排序后的最新分区,`latest` 仅在流模式的 Hive
源表作为时态表时有效。默认的选项是 `all`。在开启 'streaming-source.enable' 并设置
'streaming-source.partition.include' 为 'latest' 时,Flink 支持 temporal join 最新的
Hive 分区,同时,用户可以通过配置分区相关的选项来配置分区比较顺序和数据更新时间间隔。
</td>
</tr>
<tr>
<td><h5>streaming-source.monitor-interval</h5></td>
<td style="word-wrap: break-word;">None</td>
<td>Duration</td>
- <td>Time interval for consecutively monitoring partition/file.
- Notes: The default interval for hive streaming reading is '1 min',
the default interval for hive streaming temporal join is '60 min', this is
because there's one framework limitation that every TM will visit the Hive
metaStore in current hive streaming temporal join implementation which may
produce pressure to metaStore, this will improve in the future.</td>
+ <td>连续监控分区/文件的时间间隔。
+ 注意: 默认情况下,流式读 Hive 的间隔为 '1 min',但流读 Hive 的 temporal join 的默认时间间隔是
'60 min',这是因为当前流读 Hive 的 temporal join 实现上有一个框架限制,即每个 TM 都要访问 Hive
metaStore,这可能会对 metaStore 产生压力,这个问题将在未来得到改善。</td>
Review Comment:
```suggestion
注意: 默认情况下,流式读 Hive 的间隔为 '1 min',但流读 Hive 的 temporal join
的默认时间间隔是 '60 min',这是因为当前流读 Hive 的 temporal join 实现上有一个框架限制,即每个 TM 都要访问 Hive
metastore,这可能会对 metastore 产生压力,这个问题将在未来得到改善。</td>
```
##########
docs/content.zh/docs/connectors/table/hive/hive_read_write.md:
##########
@@ -25,74 +25,63 @@ specific language governing permissions and limitations
under the License.
-->
-# Hive Read & Write
+# Hive 读 & 写
-Using the `HiveCatalog`, Apache Flink can be used for unified `BATCH` and
`STREAM` processing of Apache
-Hive Tables. This means Flink can be used as a more performant alternative to
Hive’s batch engine,
-or to continuously read and write data into and out of Hive tables to power
real-time data
-warehousing applications.
+通过使用 `HiveCatalog`,Apache Flink 可以对 Apache Hive 表做统一的批和流处理。这意味着 Flink 可以成为
Hive 批处理引擎的一个性能更好的选择,或者连续读写 Hive 表中的数据以支持实时数据仓库应用。
-## Reading
+## 读
-Flink supports reading data from Hive in both `BATCH` and `STREAMING` modes.
When run as a `BATCH`
-application, Flink will execute its query over the state of the table at the
point in time when the
-query is executed. `STREAMING` reads will continuously monitor the table and
incrementally fetch
-new data as it is made available. Flink will read tables as bounded by default.
+Flink 支持以批和流两种模式从 Hive 表中读取数据。批读的时候,Flink
会基于执行查询时表的状态进行查询。流读时将持续监控表,并在表中新数据可用时进行增量获取,默认情况下,Flink 将以批模式读取数据。
-`STREAMING` reads support consuming both partitioned and non-partitioned
tables.
-For partitioned tables, Flink will monitor the generation of new partitions,
and read
-them incrementally when available. For non-partitioned tables, Flink will
monitor the generation
-of new files in the folder and read new files incrementally.
+流读支持消费分区表和非分区表。对于分区表,Flink 会监控新分区的生成,并且在数据可用的情况下增量获取数据。对于非分区表,Flink
将监控文件夹中新文件的生成,并增量地读取新文件。
<table class="table table-bordered">
<thead>
<tr>
- <th class="text-left" style="width: 20%">Key</th>
- <th class="text-left" style="width: 15%">Default</th>
- <th class="text-left" style="width: 10%">Type</th>
- <th class="text-left" style="width: 55%">Description</th>
+ <th class="text-left" style="width: 20%">键</th>
+ <th class="text-left" style="width: 15%">默认值</th>
+ <th class="text-left" style="width: 10%">类型</th>
+ <th class="text-left" style="width: 55%">描述</th>
</tr>
</thead>
<tbody>
<tr>
<td><h5>streaming-source.enable</h5></td>
<td style="word-wrap: break-word;">false</td>
<td>Boolean</td>
- <td>Enable streaming source or not. NOTES: Please make sure that each
partition/file should be written atomically, otherwise the reader may get
incomplete data.</td>
+ <td>是否启动流读。注意:请确保每个分区/文件都应该原子地写入,否则读取不到完整的数据。</td>
</tr>
<tr>
<td><h5>streaming-source.partition.include</h5></td>
<td style="word-wrap: break-word;">all</td>
<td>String</td>
- <td>Option to set the partitions to read, the supported option are
`all` and `latest`, the `all` means read all partitions; the `latest` means
read latest partition in order of 'streaming-source.partition.order', the
`latest` only works` when the streaming hive source table used as temporal
table. By default the option is `all`.
- Flink supports temporal join the latest hive partition by enabling
'streaming-source.enable' and setting 'streaming-source.partition.include' to
'latest', at the same time, user can assign the partition compare order and
data update interval by configuring following partition-related options.
+ <td>选择读取的分区,可选项为 `all` 和 `latest`,`all` 读取所有分区;`latest` 读取按照
'streaming-source.partition.order' 排序后的最新分区,`latest` 仅在流模式的 Hive
源表作为时态表时有效。默认的选项是 `all`。在开启 'streaming-source.enable' 并设置
'streaming-source.partition.include' 为 'latest' 时,Flink 支持 temporal join 最新的
Hive 分区,同时,用户可以通过配置分区相关的选项来配置分区比较顺序和数据更新时间间隔。
</td>
</tr>
<tr>
<td><h5>streaming-source.monitor-interval</h5></td>
<td style="word-wrap: break-word;">None</td>
<td>Duration</td>
- <td>Time interval for consecutively monitoring partition/file.
- Notes: The default interval for hive streaming reading is '1 min',
the default interval for hive streaming temporal join is '60 min', this is
because there's one framework limitation that every TM will visit the Hive
metaStore in current hive streaming temporal join implementation which may
produce pressure to metaStore, this will improve in the future.</td>
+ <td>连续监控分区/文件的时间间隔。
+ 注意: 默认情况下,流式读 Hive 的间隔为 '1 min',但流读 Hive 的 temporal join 的默认时间间隔是
'60 min',这是因为当前流读 Hive 的 temporal join 实现上有一个框架限制,即每个 TM 都要访问 Hive
metaStore,这可能会对 metaStore 产生压力,这个问题将在未来得到改善。</td>
</tr>
<tr>
<td><h5>streaming-source.partition-order</h5></td>
<td style="word-wrap: break-word;">partition-name</td>
<td>String</td>
- <td>The partition order of streaming source, support create-time,
partition-time and partition-name. create-time compares partition/file creation
time, this is not the partition create time in Hive metaStore, but the
folder/file modification time in filesystem, if the partition folder somehow
gets updated, e.g. add new file into folder, it can affect how the data is
consumed. partition-time compares the time extracted from partition name.
partition-name compares partition name's alphabetical order. For non-partition
table, this value should always be 'create-time'. By default the value is
partition-name. The option is equality with deprecated option
'streaming-source.consume-order'.</td>
+ <td>streaming source 分区排序,支持 create-time, partition-time 和
partition-name。 create-time 比较分区/文件创建时间, 这不是 Hive metastore
中创建分区的时间,而是文件夹/文件在文件系统的修改时间,如果分区文件夹以某种方式更新,比如添加在文件夹里新增了一个文件,它会影响到数据的使用。partition-time
从分区名称中抽取时间进行比较。partition-name 会比较分区名称的字典顺序。对于非分区的表,总是会比较
'create-time'。对于分区表默认值是 'partition-name'。该选项与已经弃用的
'streaming-source.consume-order' 的选项相同</td>
</tr>
<tr>
<td><h5>streaming-source.consume-start-offset</h5></td>
<td style="word-wrap: break-word;">None</td>
<td>String</td>
- <td>Start offset for streaming consuming. How to parse and compare
offsets depends on your order. For create-time and partition-time, should be a
timestamp string (yyyy-[m]m-[d]d [hh:mm:ss]). For partition-time, will use
partition time extractor to extract time from partition.
- For partition-name, is the partition name string (e.g.
pt_year=2020/pt_mon=10/pt_day=01).</td>
+ <td>流模式起始消费偏移量。如何解析和比较偏移取决于你指定的顺序。对于 create-time 和
partition-time,会比较字符串类型的 timestamp (yyyy-[m]m-[d]d [hh:mm:ss])。对于
partition-time,将使用分区时间提取器从分区提取的时间。
Review Comment:
```suggestion
<td>流模式起始消费偏移量。如何解析和比较偏移量取决于你指定的顺序。对于 create-time 和
partition-time,会比较时间戳 (yyyy-[m]m-[d]d [hh:mm:ss])。对于
partition-time,将使用分区时间提取器从分区名字中提取的时间。
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]