[inlong-website] branch master updated: [INLONG-657][Doc] Support extract node of Apache Hudi (#658)

dockerzhang Mon, 19 Dec 2022 23:21:06 -0800

This is an automated email from the ASF dual-hosted git repository.

dockerzhang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/inlong-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 5ae99f758a [INLONG-657][Doc] Support extract node of Apache Hudi (#658)
5ae99f758a is described below

commit 5ae99f758a9624effa5a1f0ac7c5d31093d0b8c3
Author: ZuoFengZhang <[email protected]>
AuthorDate: Tue Dec 20 15:20:56 2022 +0800

    [INLONG-657][Doc] Support extract node of Apache Hudi (#658)
    
    Co-authored-by: averyzhang <[email protected]>
    Co-authored-by: Charles Zhang <[email protected]>
---
 docs/data_node/extract_node/hudi.md                | 141 +++++++++++++++++++++
 docs/data_node/extract_node/img/hudi.png           | Bin 0 -> 115690 bytes
 .../data_node/{load_node => extract_node}/hudi.md  |  29 +++--
 .../current/data_node/extract_node/img/hudi.png    | Bin 0 -> 115442 bytes
 .../current/data_node/load_node/hudi.md            |   2 +-
 5 files changed, 158 insertions(+), 14 deletions(-)

diff --git a/docs/data_node/extract_node/hudi.md 
b/docs/data_node/extract_node/hudi.md
new file mode 100644
index 0000000000..aa93e42798
--- /dev/null
+++ b/docs/data_node/extract_node/hudi.md
@@ -0,0 +1,141 @@
+---
+title: Hudi
+sidebar_position: 11
+---
+
+import {siteVariables} from '../../version';
+
+## Overview
+
+[Apache Hudi](https://hudi.apache.org/cn/docs/overview/) (pronounced "hoodie") 
is a next-generation streaming data lake platform.
+Apache Hudi brings core warehouse and database functionality directly into the 
data lake.
+Hudi provides tables, transactions, efficient upserts/deletes, advanced 
indexing, streaming ingestion services, data clustering/compression 
optimizations, and concurrency while keeping data in an open source file format.
+
+## Supported Version
+
+| Load Node         | Version                                                  
        |
+| ----------------- | 
---------------------------------------------------------------- |
+| [Hudi](./hudi.md) | 
[Hudi](https://hudi.apache.org/cn/docs/quick-start-guide): 0.12+ |
+
+### Dependencies
+
+Introduce `sort-connector-hudi` through `Maven` to build your own project.
+Of course, you can also directly use the `jar` package provided by `INLONG`.
+([sort-connector-hudi](https://inlong.apache.org/download/))
+
+### Maven dependency
+
+<pre><code parentName="pre">
+{`<dependency>
+    <groupId>org.apache.inlong</groupId>
+    <artifactId>sort-connector-hudi</artifactId>
+    <version>${siteVariables.inLongVersion}</version>
+</dependency>
+`}
+</code></pre>
+
+## How to create a Hudi Extract Node
+
+### Usage for SQL API
+
+The example below shows how to create a Hudi Load Node with `Flink SQL Cli` :
+
+```sql
+CREATE TABLE `hudi_table_name` (
+  id STRING,
+  name STRING,
+  uv BIGINT,
+  pv BIGINT
+) WITH (
+    'connector' = 'hudi-inlong',
+    'path' = 
'hdfs://127.0.0.1:90001/data/warehouse/hudi_db_name.db/hudi_table_name',
+    'uri' = 'thrift://127.0.0.1:8091',
+    'hoodie.database.name' = 'hudi_db_name',
+    'hoodie.table.name' = 'hudi_table_name',
+    'read.streaming.check-interval'='1',
+    'read.streaming.enabled'='true',
+    'read.streaming.skip_compaction'='true',
+    'read.start-commit'='20221220121000',
+    --
+    'hoodie.bucket.index.hash.field' = 'id',
+    -- compaction
+    'compaction.tasks' = '10',
+    'compaction.async.enabled' = 'true',
+    'compaction.schedule.enabled' = 'true',
+    'compaction.max_memory' = '3096',
+    'compaction.trigger.strategy' = 'num_or_time',
+    'compaction.delta_commits' = '5',
+    'compaction.max_memory' = '3096',
+    --
+    'hoodie.keep.min.commits' = '1440',
+    'hoodie.keep.max.commits' = '2880',
+    'clean.async.enabled' = 'true',
+    --
+    'write.operation' = 'upsert',
+    'write.bucket_assign.tasks' = '60',
+    'write.tasks' = '60',
+    'write.log_block.size' = '128',
+    --
+    'index.type' = 'BUCKET',
+    'metadata.enabled' = 'false',
+    'hoodie.bucket.index.num.buckets' = '20',
+    'table.type' = 'MERGE_ON_READ',
+    'clean.retain_commits' = '30',
+    'hoodie.cleaner.policy' = 'KEEP_LATEST_COMMITS'
+);
+```
+
+### Usage for Dashboard
+
+#### Configuration
+
+When creating a data stream, select `Hudi` for the data stream direction, and 
click "Add" to configure it.
+
+![Hudi Configuration](img/hudi.png)
+
+| Config Item                          | prop in DDL statement                 
        | remark                                                                
                                                                                
                               |
+| ------------------------------------ | 
--------------------------------------------- | 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
+| `DbName`                             | `hoodie.database.name`                
        | the name of database                                                  
                                                                                
                               |
+| `TableName`                          | `hudi_table_name`                     
        | the name of table                                                     
                                                                                
                               |
+| `EnableCreateResource`               | -                                     
        | If the library table already exists and does not need to be modified, 
select [Do not create], <br/>otherwise select [Create], and the system will 
automatically create the resource. |
+| `Catalog URI`                        | `uri`                                 
        | The server uri of catalog                                             
                                                                                
                               |
+| `Warehouse`                          | -                                     
        | The location where the hudi table is stored in HDFS<br/>In the SQL 
DDL, the path attribute is to splice the `warehouse path` with the name of db 
and table                           |
+| `StartCommit` | `read.start-commit`     | Start commit instant for reading, 
the commit time format should be 'yyyyMMddHHmmss', by default reading from the 
latest instant for streaming read |
+| `SkipCompaction` | `read.streaming.skip_compaction`  | Whether to skip 
compaction instants for streaming read, there are two cases that this option 
can be used to avoid reading duplicates: 1) you are definitely sure that the 
consumer reads faster than any compaction instants, usually with delta time 
compaction strategy that is long enough, for e.g, one week; 2) changelog mode 
is enabled, this option is a solution to keep data integrity  |
+
+### Usage for InLong Manager Client
+TODO
+
+## Hudi Extract Node Options
+
+| Option                                      | Required | Default | Type   | 
Description                                                                     
                                                                                
                                                              |
+| ------------------------------------------- | -------- | ------- | ------ | 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
+| connector                                   | required | (none)  | String | 
Specify what connector to use, here should be 'hudi-inlong'.                    
                                                                                
                                                              |
+| uri                                         | required | (none)  | String | 
Metastore uris for hive sync                                                    
                                                                                
                                                              |
+| hoodie.database.name                        | optional | (none)  | String | 
Database name that will be used for incremental query.If different databases 
have the same table name during  incremental query,  we can set it to limit the 
table name under a specific database                             |
+| hoodie.table.name                           | optional | (none)  | String | 
Table name that will be used for registering with Hive. Needs to be same across 
runs.                                                                           
                                                              |
+| `read.start-commit`     | optional  | newest commit id |String | Start 
commit instant for reading, the commit time format should be 'yyyyMMddHHmmss', 
by default reading from the latest instant for streaming read |
+| `read.streaming.skip_compaction`  | option | false | String | Whether to 
skip compaction instants for streaming read, there are two cases that this 
option can be used to avoid reading duplicates: 1) you are definitely sure that 
the consumer reads faster than any compaction instants, usually with delta time 
compaction strategy that is long enough, for e.g, one week; 2) changelog mode 
is enabled, this option is a solution to keep data integrity  |
+| inlong.metric.labels                        | optional | (none)  | String | 
Inlong metric label, format of value is 
groupId=xxgroup&streamId=xxstream&nodeId=xxnode.                                
                                                                                
                      |
+
+## Data Type Mapping
+
+| Hive type     | Flink SQL type |
+| ------------- | -------------- |
+| char(p)       | CHAR(p)        |
+| varchar(p)    | VARCHAR(p)     |
+| string        | STRING         |
+| boolean       | BOOLEAN        |
+| tinyint       | TINYINT        |
+| smallint      | SMALLINT       |
+| int           | INT            |
+| bigint        | BIGINT         |
+| float         | FLOAT          |
+| double        | DOUBLE         |
+| decimal(p, s) | DECIMAL(p, s)  |
+| date          | DATE           |
+| timestamp(9)  | TIMESTAMP      |
+| bytes         | BINARY         |
+| array         | LIST           |
+| map           | MAP            |
+| row           | STRUCT         |
diff --git a/docs/data_node/extract_node/img/hudi.png 
b/docs/data_node/extract_node/img/hudi.png
new file mode 100644
index 0000000000..89f9df1c61
Binary files /dev/null and b/docs/data_node/extract_node/img/hudi.png differ
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/load_node/hudi.md 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/extract_node/hudi.md
similarity index 76%
copy from 
i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/load_node/hudi.md
copy to 
i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/extract_node/hudi.md
index 1fe1b91ab9..463648c975 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/load_node/hudi.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/extract_node/hudi.md
@@ -1,6 +1,6 @@
 ---
 title: Hudi
-sidebar_position: 18
+sidebar_position: 12
 ---
 
 import {siteVariables} from '../../version';
@@ -33,7 +33,7 @@ Hudi 提供表、事务、高效的 upserts/delete、高级索引、流摄入服
 `}
 </code></pre>
 
-## 如何配置 Hudi 数据加载节点
+## 如何配置 Hudi 数据抽取节点
 
 ### SQL API 的使用
 
@@ -51,7 +51,11 @@ CREATE TABLE `hudi_table_name` (
     'uri' = 'thrift://127.0.0.1:8091',
     'hoodie.database.name' = 'hudi_db_name',
     'hoodie.table.name' = 'hudi_table_name',
-    'hoodie.datasource.write.recordkey.field' = 'id',
+    'read.streaming.check-interval'='1',
+    'read.streaming.enabled'='true',
+    'read.streaming.skip_compaction'='true',
+    'read.start-commit'='20221220121000',
+    --
     'hoodie.bucket.index.hash.field' = 'id',
     -- compaction
     'compaction.tasks' = '10',
@@ -80,11 +84,11 @@ CREATE TABLE `hudi_table_name` (
 );
 ```
 
-### InLong Dashboard 方式
+### Dashboard 方式
 
 #### 配置
 
-在创建数据流时，选择数据落地为 'Hive' 然后点击 'Add' 来配置 Hive 的相关信息。
+在创建数据流时，选择数据落地为 'hudi' 然后点击 'Add' 来配置 Hudi 的相关信息。
 
 ![Hudi Configuration](img/hudi.png)
 
@@ -95,25 +99,24 @@ CREATE TABLE `hudi_table_name` (
 | `是否创建资源`       | -                                             | 
如果库表已经存在，且无需修改，则选【不创建】；<br/>否则请选择【创建】，由系统自动创建资源。        |
 | `Catalog URI`  | `uri`                                         | 元数据服务地址     
                                            |
 | `仓库路径`         | -                                             | 
hudi表存储在HDFS中的位置<br/>在SQL DDL中path属性是将`仓库路径`与库、表名称拼接在一起 |
-| `属性`           | -                                             | 
hudi表的DDL属性需带前缀'ddl.'                                   |
-| `高级选项`>`数据一致性` | -                                             | 
Flink计算引擎的一致性语义: `EXACTLY_ONCE`或`AT_LEAST_ONCE`         |
-| `分区字段`         | `hoodie.datasource.write.partitionpath.field` | 分区字段        
                                            |
-| `主键字段`         | `hoodie.datasource.write.recordkey.field`     | 主键字段        
                                            |
+| `跳过合并中的提交`         | `read.streaming.skip_compaction`     | 流读时是否跳过 
compaction 的 commits，跳过 compaction 有两个用途：1）避免 upsert 语义 下重复消费(compaction 的 
instant 为重复数据，如果不跳过，有小概率会重复消费）; 2) changelog 模式下保证语义正确性。 0.11 开始，以上两个问题已经通过保留 
compaction 的 instant time 修复                                                   |
+| `起始的commit`         | ``read.start-commit``     | 起始commit, 
格式为`yyyyMMddHHmmss`  |
 
 ### InLong Manager Client 方式
 
 TODO: 未来版本支持
 
-## Hudi 加载节点参数信息
+## Hudi 抽取节点参数信息
 
 | 选项                                          | 必填  | 类型     | 描述              
                                                                                
|
 | ------------------------------------------- | --- | ------ | 
-----------------------------------------------------------------------------------------------
 |
 | connector                                   | 必填  | String | 
指定要使用的Connector，这里应该是'hudi-inlong'。                                             
                |
-| uri                                         | 必填  | String | 用于配置单元同步的 
Metastore uris                                                                  
      |
+| uri                                         | 可选  | String | 用于配置单元同步的 
Metastore uris                                                                  
      |
+| path                                         | 必填  | String |  
用户保存hudi表的文件目录                                                                  
      |
 | hoodie.database.name                        | 可选  | String | 
将用于增量查询的数据库名称。如果不同数据库在增量查询时有相同的表名，我们可以设置它来限制特定数据库下的表名                           
                |
 | hoodie.table.name                           | 可选  | String | 将用于向 Hive 
注册的表名。 需要在运行中保持一致。                                                              
      |
-| hoodie.datasource.write.recordkey.field     | 必填  | String | 记录的主键字段。 
用作“HoodieKey”的“recordKey”组件的值。 实际值将通过在字段值上调用 .toString() 来获得。 
可以使用点符号指定嵌套字段，例如：`a.b.c` |
-| hoodie.datasource.write.partitionpath.field | 可选  | String | 分区路径字段。 在 
HoodieKey 的 partitionPath 组件中使用的值。 通过调用 .toString() 获得的实际值                      
      |
+| `read.start-commit`     | 可选  | String | 指定`yyyyMMddHHmmss`格式的起始commit(闭区间) |
+| `read.streaming.skip_compaction`  | 可选  | String | 流读时是否跳过 compaction 的 
commits(默认不跳过)，跳过 compaction 有两个用途：1）避免 upsert 语义 下重复消费(compaction 的 instant 
为重复数据，如果不跳过，有小概率会重复消费）; 2) changelog 模式下保证语义正确性。 0.11 开始，以上两个问题已经通过保留 
compaction 的 instant time 修复          |
 | inlong.metric.labels                        | 可选  | String | 在long metric 
label中，value的格式为groupId=xxgroup&streamId=xxstream&nodeId=xxnode。                
   |
 
 ## 数据类型映射
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/extract_node/img/hudi.png
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/extract_node/img/hudi.png
new file mode 100644
index 0000000000..bf563acbf7
Binary files /dev/null and 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/extract_node/img/hudi.png
 differ
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/load_node/hudi.md 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/load_node/hudi.md
index 1fe1b91ab9..4f3683b30a 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/load_node/hudi.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/load_node/hudi.md
@@ -84,7 +84,7 @@ CREATE TABLE `hudi_table_name` (
 
 #### 配置
 
-在创建数据流时，选择数据落地为 'Hive' 然后点击 'Add' 来配置 Hive 的相关信息。
+在创建数据流时，选择数据落地为 'Hudi' 然后点击 'Add' 来配置 Hudi 的相关信息。
 
 ![Hudi Configuration](img/hudi.png)

[inlong-website] branch master updated: [INLONG-657][Doc] Support extract node of Apache Hudi (#658)

Reply via email to