This is an automated email from the ASF dual-hosted git repository.
dockerzhang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/inlong-website.git
The following commit(s) were added to refs/heads/master by this push:
new 5ae99f758a [INLONG-657][Doc] Support extract node of Apache Hudi (#658)
5ae99f758a is described below
commit 5ae99f758a9624effa5a1f0ac7c5d31093d0b8c3
Author: ZuoFengZhang <[email protected]>
AuthorDate: Tue Dec 20 15:20:56 2022 +0800
[INLONG-657][Doc] Support extract node of Apache Hudi (#658)
Co-authored-by: averyzhang <[email protected]>
Co-authored-by: Charles Zhang <[email protected]>
---
docs/data_node/extract_node/hudi.md | 141 +++++++++++++++++++++
docs/data_node/extract_node/img/hudi.png | Bin 0 -> 115690 bytes
.../data_node/{load_node => extract_node}/hudi.md | 29 +++--
.../current/data_node/extract_node/img/hudi.png | Bin 0 -> 115442 bytes
.../current/data_node/load_node/hudi.md | 2 +-
5 files changed, 158 insertions(+), 14 deletions(-)
diff --git a/docs/data_node/extract_node/hudi.md
b/docs/data_node/extract_node/hudi.md
new file mode 100644
index 0000000000..aa93e42798
--- /dev/null
+++ b/docs/data_node/extract_node/hudi.md
@@ -0,0 +1,141 @@
+---
+title: Hudi
+sidebar_position: 11
+---
+
+import {siteVariables} from '../../version';
+
+## Overview
+
+[Apache Hudi](https://hudi.apache.org/cn/docs/overview/) (pronounced "hoodie")
is a next-generation streaming data lake platform.
+Apache Hudi brings core warehouse and database functionality directly into the
data lake.
+Hudi provides tables, transactions, efficient upserts/deletes, advanced
indexing, streaming ingestion services, data clustering/compression
optimizations, and concurrency while keeping data in an open source file format.
+
+## Supported Version
+
+| Load Node | Version
|
+| ----------------- |
---------------------------------------------------------------- |
+| [Hudi](./hudi.md) |
[Hudi](https://hudi.apache.org/cn/docs/quick-start-guide): 0.12+ |
+
+### Dependencies
+
+Introduce `sort-connector-hudi` through `Maven` to build your own project.
+Of course, you can also directly use the `jar` package provided by `INLONG`.
+([sort-connector-hudi](https://inlong.apache.org/download/))
+
+### Maven dependency
+
+<pre><code parentName="pre">
+{`<dependency>
+ <groupId>org.apache.inlong</groupId>
+ <artifactId>sort-connector-hudi</artifactId>
+ <version>${siteVariables.inLongVersion}</version>
+</dependency>
+`}
+</code></pre>
+
+## How to create a Hudi Extract Node
+
+### Usage for SQL API
+
+The example below shows how to create a Hudi Load Node with `Flink SQL Cli` :
+
+```sql
+CREATE TABLE `hudi_table_name` (
+ id STRING,
+ name STRING,
+ uv BIGINT,
+ pv BIGINT
+) WITH (
+ 'connector' = 'hudi-inlong',
+ 'path' =
'hdfs://127.0.0.1:90001/data/warehouse/hudi_db_name.db/hudi_table_name',
+ 'uri' = 'thrift://127.0.0.1:8091',
+ 'hoodie.database.name' = 'hudi_db_name',
+ 'hoodie.table.name' = 'hudi_table_name',
+ 'read.streaming.check-interval'='1',
+ 'read.streaming.enabled'='true',
+ 'read.streaming.skip_compaction'='true',
+ 'read.start-commit'='20221220121000',
+ --
+ 'hoodie.bucket.index.hash.field' = 'id',
+ -- compaction
+ 'compaction.tasks' = '10',
+ 'compaction.async.enabled' = 'true',
+ 'compaction.schedule.enabled' = 'true',
+ 'compaction.max_memory' = '3096',
+ 'compaction.trigger.strategy' = 'num_or_time',
+ 'compaction.delta_commits' = '5',
+ 'compaction.max_memory' = '3096',
+ --
+ 'hoodie.keep.min.commits' = '1440',
+ 'hoodie.keep.max.commits' = '2880',
+ 'clean.async.enabled' = 'true',
+ --
+ 'write.operation' = 'upsert',
+ 'write.bucket_assign.tasks' = '60',
+ 'write.tasks' = '60',
+ 'write.log_block.size' = '128',
+ --
+ 'index.type' = 'BUCKET',
+ 'metadata.enabled' = 'false',
+ 'hoodie.bucket.index.num.buckets' = '20',
+ 'table.type' = 'MERGE_ON_READ',
+ 'clean.retain_commits' = '30',
+ 'hoodie.cleaner.policy' = 'KEEP_LATEST_COMMITS'
+);
+```
+
+### Usage for Dashboard
+
+#### Configuration
+
+When creating a data stream, select `Hudi` for the data stream direction, and
click "Add" to configure it.
+
+
+
+| Config Item | prop in DDL statement
| remark
|
+| ------------------------------------ |
--------------------------------------------- |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
+| `DbName` | `hoodie.database.name`
| the name of database
|
+| `TableName` | `hudi_table_name`
| the name of table
|
+| `EnableCreateResource` | -
| If the library table already exists and does not need to be modified,
select [Do not create], <br/>otherwise select [Create], and the system will
automatically create the resource. |
+| `Catalog URI` | `uri`
| The server uri of catalog
|
+| `Warehouse` | -
| The location where the hudi table is stored in HDFS<br/>In the SQL
DDL, the path attribute is to splice the `warehouse path` with the name of db
and table |
+| `StartCommit` | `read.start-commit` | Start commit instant for reading,
the commit time format should be 'yyyyMMddHHmmss', by default reading from the
latest instant for streaming read |
+| `SkipCompaction` | `read.streaming.skip_compaction` | Whether to skip
compaction instants for streaming read, there are two cases that this option
can be used to avoid reading duplicates: 1) you are definitely sure that the
consumer reads faster than any compaction instants, usually with delta time
compaction strategy that is long enough, for e.g, one week; 2) changelog mode
is enabled, this option is a solution to keep data integrity |
+
+### Usage for InLong Manager Client
+TODO
+
+## Hudi Extract Node Options
+
+| Option | Required | Default | Type |
Description
|
+| ------------------------------------------- | -------- | ------- | ------ |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
+| connector | required | (none) | String |
Specify what connector to use, here should be 'hudi-inlong'.
|
+| uri | required | (none) | String |
Metastore uris for hive sync
|
+| hoodie.database.name | optional | (none) | String |
Database name that will be used for incremental query.If different databases
have the same table name during incremental query, we can set it to limit the
table name under a specific database |
+| hoodie.table.name | optional | (none) | String |
Table name that will be used for registering with Hive. Needs to be same across
runs.
|
+| `read.start-commit` | optional | newest commit id |String | Start
commit instant for reading, the commit time format should be 'yyyyMMddHHmmss',
by default reading from the latest instant for streaming read |
+| `read.streaming.skip_compaction` | option | false | String | Whether to
skip compaction instants for streaming read, there are two cases that this
option can be used to avoid reading duplicates: 1) you are definitely sure that
the consumer reads faster than any compaction instants, usually with delta time
compaction strategy that is long enough, for e.g, one week; 2) changelog mode
is enabled, this option is a solution to keep data integrity |
+| inlong.metric.labels | optional | (none) | String |
Inlong metric label, format of value is
groupId=xxgroup&streamId=xxstream&nodeId=xxnode.
|
+
+## Data Type Mapping
+
+| Hive type | Flink SQL type |
+| ------------- | -------------- |
+| char(p) | CHAR(p) |
+| varchar(p) | VARCHAR(p) |
+| string | STRING |
+| boolean | BOOLEAN |
+| tinyint | TINYINT |
+| smallint | SMALLINT |
+| int | INT |
+| bigint | BIGINT |
+| float | FLOAT |
+| double | DOUBLE |
+| decimal(p, s) | DECIMAL(p, s) |
+| date | DATE |
+| timestamp(9) | TIMESTAMP |
+| bytes | BINARY |
+| array | LIST |
+| map | MAP |
+| row | STRUCT |
diff --git a/docs/data_node/extract_node/img/hudi.png
b/docs/data_node/extract_node/img/hudi.png
new file mode 100644
index 0000000000..89f9df1c61
Binary files /dev/null and b/docs/data_node/extract_node/img/hudi.png differ
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/load_node/hudi.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/extract_node/hudi.md
similarity index 76%
copy from
i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/load_node/hudi.md
copy to
i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/extract_node/hudi.md
index 1fe1b91ab9..463648c975 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/load_node/hudi.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/extract_node/hudi.md
@@ -1,6 +1,6 @@
---
title: Hudi
-sidebar_position: 18
+sidebar_position: 12
---
import {siteVariables} from '../../version';
@@ -33,7 +33,7 @@ Hudi 提供表、事务、高效的 upserts/delete、高级索引、流摄入服
`}
</code></pre>
-## 如何配置 Hudi 数据加载节点
+## 如何配置 Hudi 数据抽取节点
### SQL API 的使用
@@ -51,7 +51,11 @@ CREATE TABLE `hudi_table_name` (
'uri' = 'thrift://127.0.0.1:8091',
'hoodie.database.name' = 'hudi_db_name',
'hoodie.table.name' = 'hudi_table_name',
- 'hoodie.datasource.write.recordkey.field' = 'id',
+ 'read.streaming.check-interval'='1',
+ 'read.streaming.enabled'='true',
+ 'read.streaming.skip_compaction'='true',
+ 'read.start-commit'='20221220121000',
+ --
'hoodie.bucket.index.hash.field' = 'id',
-- compaction
'compaction.tasks' = '10',
@@ -80,11 +84,11 @@ CREATE TABLE `hudi_table_name` (
);
```
-### InLong Dashboard 方式
+### Dashboard 方式
#### 配置
-在创建数据流时,选择数据落地为 'Hive' 然后点击 'Add' 来配置 Hive 的相关信息。
+在创建数据流时,选择数据落地为 'hudi' 然后点击 'Add' 来配置 Hudi 的相关信息。

@@ -95,25 +99,24 @@ CREATE TABLE `hudi_table_name` (
| `是否创建资源` | - |
如果库表已经存在,且无需修改,则选【不创建】;<br/>否则请选择【创建】,由系统自动创建资源。 |
| `Catalog URI` | `uri` | 元数据服务地址
|
| `仓库路径` | - |
hudi表存储在HDFS中的位置<br/>在SQL DDL中path属性是将`仓库路径`与库、表名称拼接在一起 |
-| `属性` | - |
hudi表的DDL属性需带前缀'ddl.' |
-| `高级选项`>`数据一致性` | - |
Flink计算引擎的一致性语义: `EXACTLY_ONCE`或`AT_LEAST_ONCE` |
-| `分区字段` | `hoodie.datasource.write.partitionpath.field` | 分区字段
|
-| `主键字段` | `hoodie.datasource.write.recordkey.field` | 主键字段
|
+| `跳过合并中的提交` | `read.streaming.skip_compaction` | 流读时是否跳过
compaction 的 commits,跳过 compaction 有两个用途:1)避免 upsert 语义 下重复消费(compaction 的
instant 为重复数据,如果不跳过,有小概率会重复消费); 2) changelog 模式下保证语义正确性。 0.11 开始,以上两个问题已经通过保留
compaction 的 instant time 修复 |
+| `起始的commit` | ``read.start-commit`` | 起始commit,
格式为`yyyyMMddHHmmss` |
### InLong Manager Client 方式
TODO: 未来版本支持
-## Hudi 加载节点参数信息
+## Hudi 抽取节点参数信息
| 选项 | 必填 | 类型 | 描述
|
| ------------------------------------------- | --- | ------ |
-----------------------------------------------------------------------------------------------
|
| connector | 必填 | String |
指定要使用的Connector,这里应该是'hudi-inlong'。
|
-| uri | 必填 | String | 用于配置单元同步的
Metastore uris
|
+| uri | 可选 | String | 用于配置单元同步的
Metastore uris
|
+| path | 必填 | String |
用户保存hudi表的文件目录
|
| hoodie.database.name | 可选 | String |
将用于增量查询的数据库名称。如果不同数据库在增量查询时有相同的表名,我们可以设置它来限制特定数据库下的表名
|
| hoodie.table.name | 可选 | String | 将用于向 Hive
注册的表名。 需要在运行中保持一致。
|
-| hoodie.datasource.write.recordkey.field | 必填 | String | 记录的主键字段。
用作“HoodieKey”的“recordKey”组件的值。 实际值将通过在字段值上调用 .toString() 来获得。
可以使用点符号指定嵌套字段,例如:`a.b.c` |
-| hoodie.datasource.write.partitionpath.field | 可选 | String | 分区路径字段。 在
HoodieKey 的 partitionPath 组件中使用的值。 通过调用 .toString() 获得的实际值
|
+| `read.start-commit` | 可选 | String | 指定`yyyyMMddHHmmss`格式的起始commit(闭区间) |
+| `read.streaming.skip_compaction` | 可选 | String | 流读时是否跳过 compaction 的
commits(默认不跳过),跳过 compaction 有两个用途:1)避免 upsert 语义 下重复消费(compaction 的 instant
为重复数据,如果不跳过,有小概率会重复消费); 2) changelog 模式下保证语义正确性。 0.11 开始,以上两个问题已经通过保留
compaction 的 instant time 修复 |
| inlong.metric.labels | 可选 | String | 在long metric
label中,value的格式为groupId=xxgroup&streamId=xxstream&nodeId=xxnode。
|
## 数据类型映射
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/extract_node/img/hudi.png
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/extract_node/img/hudi.png
new file mode 100644
index 0000000000..bf563acbf7
Binary files /dev/null and
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/extract_node/img/hudi.png
differ
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/load_node/hudi.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/load_node/hudi.md
index 1fe1b91ab9..4f3683b30a 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/load_node/hudi.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/load_node/hudi.md
@@ -84,7 +84,7 @@ CREATE TABLE `hudi_table_name` (
#### 配置
-在创建数据流时,选择数据落地为 'Hive' 然后点击 'Add' 来配置 Hive 的相关信息。
+在创建数据流时,选择数据落地为 'Hudi' 然后点击 'Add' 来配置 Hudi 的相关信息。
