[GitHub] [inlong-website] dockerzhang commented on a diff in pull request #650: [INLONG-649][Doc] Add the usage document for Apache Hudi

GitBox Sun, 18 Dec 2022 22:50:04 -0800


dockerzhang commented on code in PR #650:
URL: https://github.com/apache/inlong-website/pull/650#discussion_r1051860728



##########
i18n/zh-CN/docusaurus-plugin-content-docs-community/current/how-to-report-issues.md:
##########
@@ -32,16 +32,16 @@ Apache InLong 项目使用 GitHub Issues 来跟踪所有问题。 这些包括
 
 对于摘要，请提供详细的标题，例如 `[Bug][Dataproxy] Repeated registration jmx metric bean` 而不是 
`Dataproxy registration error`。
 
-|       组件        | 描述                                                         
           |
-|:---------------:|:----------------------------------------------------------------------|
-|      Agent      | 数据采集 Agent，支持从指定目录或文件读取常规日志、逐条上报。后续也将扩展 DB 采集等能力。          
           |
-|    DataProxy    | 一个基于 Flume-ng 的 Proxy 组件，支持数据发送阻塞和落盘重发，拥有将接收到的数据转发到不同 
MQ（消息队列）的能力。    |
-|     TubeMQ      | 腾讯自研的消息队列服务，专注于大数据场景下海量数据的高性能存储和传输，在海量实践和低成本方面有着良好的核心优势。   
           |
-|      Sort       | 对从不同的 MQ 消费到的数据进行 ETL 处理，然后汇聚并写入 
Hive、ClickHouse、Hbase、Iceberg 等存储系统。 |
-|     Manage      | 提供完整的数据服务管控能力，包括元数据、任务流、权限，OpenAPI 等。                      
           |
-|    Dashboard    | 用于管理数据接入的前端页面，简化整个 InLong 管控平台的使用。                         
           |
-|      Audit      | 对 InLong 系统的 Agent、DataProxy、Sort 模块的入流量、出流量进行实时审计对账。      
           |
-|       SDK       | 包括 DataProxy SDK, Sort SDK 等                               
           |
+|        组件        |                                  描述                       
            |
+|:----------------:|:---------------------------------------------------------------------:|
+|   InLong Agent   |           数据采集 Agent，支持从指定目录或文件读取常规日志、逐条上报。后续也将扩展 DB 
采集等能力。           |
+| InLong Dataproxy |  一个基于 Flume-ng 的 Proxy 组件，支持数据发送阻塞和落盘重发，拥有将接收到的数据转发到不同 
MQ（消息队列）的能力。   |
+|  InLong TubeMQ   |       
腾讯自研的消息队列服务，专注于大数据场景下海量数据的高性能存储和传输，在海量实践和低成本方面有着良好的核心优势。        |
+|   InLong Sort    | 对从不同的 MQ 消费到的数据进行 ETL 处理，然后汇聚并写入 
Hive、ClickHouse、Hbase、Iceberg、Hudi 等存储系统。 |
+|  InLong Manager  |                 提供完整的数据服务管控能力，包括元数据、任务流、权限，OpenAPI等。      
            |
+| InLong Dashboard |                  用于管理数据接入的前端页面，简化整个 InLong 管控平台的使用。       
            |
+|   InLong Audit   |         对 InLong 系统的 Agent、DataProxy、Sort 
模块的入流量、出流量进行实时审计对账。         |
+|    InLong SDK    |                     包括 DataProxy SDK, Sort SDK 等          
            |

Review Comment:
   Do we need to change the file?



##########
i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/load_node/hudi.md:
##########
@@ -0,0 +1,204 @@
+---
+title: Hudi
+sidebar_position: 18
+---
+
+import {siteVariables} from '../../version';
+
+## 概览
+[Apache Hudi](https://hudi.apache.org/cn/docs/overview/) 
(发音为"hoodie")是下一代流式数据湖平台。
+Apache Hudi 将核心仓库和数据库功能直接带到数据湖中。
+Hudi 提供表、事务、高效的 upserts/delete、高级索引、流摄入服务、数据聚类/压缩优化和并发，同时保持数据的开源文件格式。
+
+## 支持的版本
+
+| Load Node                           | Version                                
            | 
+|-------------------------------------|----------------------------------------------------|
+| [Hudi](./hudi.md) | 
[Hudi](https://hudi.apache.org/cn/docs/quick-start-guide): 0.12+ |
+
+### 依赖
+
+通过 `Maven` 引入 `sort-connector-hudi` 构建自己的项目。
+当然，你也可以直接使用 `INLONG` 提供的 `jar` 
包。([sort-connector-hudi](https://inlong.apache.org/download))
+
+### Maven 依赖
+
+<pre><code parentName="pre">
+{`<dependency>
+    <groupId>org.apache.inlong</groupId>
+    <artifactId>sort-connector-hudi</artifactId>
+    <version>${siteVariables.inLongVersion}</version>
+</dependency>
+`}
+</code></pre>
+
+## 如何配置 Hudi 数据加载节点
+
+### SQL API 的使用
+
+使用 `Flink SQL Cli` :
+
+```sql
+CREATE TABLE `hudi_table_name` (
+  id STRING,
+  name STRING,
+  uv BIGINT,
+  pv BIGINT
+) WITH (
+    'connector' = 'hudi-inlong',
+    'path' = 
'hdfs://127.0.0.1:90001/data/warehouse/hudi_db_name.db/hudi_table_name',
+    'uri' = 'thrift://127.0.0.1:8091',
+    'hoodie.database.name' = 'hudi_db_name',
+    'hoodie.table.name' = 'hudi_table_name',
+    'hoodie.datasource.write.recordkey.field' = 'id',
+    'hoodie.bucket.index.hash.field' = 'id',
+    -- compaction
+    'compaction.tasks' = '10',
+    'compaction.async.enabled' = 'true',
+    'compaction.schedule.enabled' = 'true',
+    'compaction.max_memory' = '3096',
+    'compaction.trigger.strategy' = 'num_or_time',
+    'compaction.delta_commits' = '5',
+    'compaction.max_memory' = '3096',
+    --
+    'hoodie.keep.min.commits' = '1440',
+    'hoodie.keep.max.commits' = '2880',
+    'clean.async.enabled' = 'true',
+    --
+    'write.operation' = 'upsert',
+    'write.bucket_assign.tasks' = '60',
+    'write.tasks' = '60',
+    'write.log_block.size' = '128',
+    --
+    'index.type' = 'BUCKET',
+    'metadata.enabled' = 'false',
+    'hoodie.bucket.index.num.buckets' = '20',
+    'table.type' = 'MERGE_ON_READ',
+    'clean.retain_commits' = '30',
+    'hoodie.cleaner.policy' = 'KEEP_LATEST_COMMITS'
+);
+```
+
+### InLong Dashboard 方式
+
+#### 配置
+在创建数据流时，选择数据落地为 'Hive' 然后点击 'Add' 来配置 Hive 的相关信息。
+
+![Hudi Configuration](img/hudi_sink_conf.png)
+
+| 配置项 | 对应SQL DDL中的属性 | 备注 |
+| --- | --- | --- |
+|`DB名称`| `hoodie.database.name` | 库名称 |
+|`表名`|`hudi_table_name`| hudi表名 |
+|`是否创建资源`| - | 如果库表已经存在，且无需修改，则选【不创建】；<br/>否则请选择【创建】，由系统自动创建资源。 |
+|`Catalog URI`|`uri`| 元数据服务地址 |
+|`仓库路径`| - | hudi表存储在HDFS中的位置<br/>在SQL DDL中path属性是将`仓库路径`与库、表名称拼接在一起 |
+|`属性`| - |hudi表的DDL属性需带前缀'ddl.'|
+|`高级选项`>`数据一致性` | - | Flink计算引擎的一致性语义: `EXACTLY_ONCE`或`AT_LEAST_ONCE` |
+|`分区字段` | `hoodie.datasource.write.partitionpath.field` | 分区字段 |
+|`主键字段` | `hoodie.datasource.write.recordkey.field` | 主键字段 |
+
+### InLong Manager Client 方式
+
+TODO: 未来版本支持
+
+## Hudi 加载节点参数信息
+
+
+
+
+
+| 选项                                        | 必填 | 类型   | 描述                   
                                      |
+| ------------------------------------------- | ---- | ------ | 
------------------------------------------------------------ |
+| connector                                   | 必填 | String | 
指定要使用的Connector，这里应该是'hudi-inlong'。             |
+| uri                                         | 必填 | String | 用于配置单元同步的 
Metastore uris                            |
+| hoodie.database.name                        | 可选 | String | 
将用于增量查询的数据库名称。如果不同数据库在增量查询时有相同的表名，我们可以设置它来限制特定数据库下的表名 |
+| hoodie.table.name                           | 可选 | String | 将用于向 Hive 注册的表名。 
需要在运行中保持一致。            |
+| hoodie.datasource.write.recordkey.field     | 必填 | String | 记录的主键字段。 
用作“HoodieKey”的“recordKey”组件的值。 实际值将通过在字段值上调用 .toString() 来获得。 
可以使用点符号指定嵌套字段，例如：`a.b.c` |
+| hoodie.datasource.write.partitionpath.field | 可选 | String | 分区路径字段。 在 
HoodieKey 的 partitionPath 组件中使用的值。 通过调用 .toString() 获得的实际值 |
+| inlong.metric.labels                        | 可选 | String | 在long metric 
label中，value的格式为groupId=xxgroup&streamId=xxstream&nodeId=xxnode。 |
+
+
+
+## 数据类型映射
+
+<div class="wy-table-responsive">

Review Comment:
   please use markdown table



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [inlong-website] dockerzhang commented on a diff in pull request #650: [INLONG-649][Doc] Add the usage document for Apache Hudi

Reply via email to