kinghao007 opened a new issue, #5326: URL: https://github.com/apache/linkis/issues/5326
### Search before asking - [x] I searched the [issues](https://github.com/apache/linkis/issues) and found no similar issues. ### Linkis Component linkis-engineconn-plugins ### What happened **English:** Linkis currently lacks native support for Apache Hudi, a streaming data lake platform that brings database and data warehouse capabilities to the data lake. Hudi has become the standard for incremental data processing and change data capture (CDC) in streaming data lakes, adopted by Uber, AWS, ByteDance, and other companies handling PB-scale streaming data. **Market Demand:** - **Streaming Data Lake Leader**: Apache top-level project, originated at Uber for PB-scale streaming data - **Enterprise Adoption**: Uber, AWS, ByteDance, Robinhood use Hudi for real-time data lakes - **AWS Integration**: Amazon EMR and AWS Glue natively support Hudi - **Incremental Processing**: Best-in-class support for CDC and incremental data pipelines - **ACID Guarantees**: Full ACID transactions with record-level updates and deletes **Technical Advantages:** - **Upsert and Delete**: Native support for record-level updates and deletes on data lakes - **Incremental Queries**: Efficient incremental reads for ETL pipelines - **Time Travel**: Query historical versions of data - **Copy-on-Write (CoW)**: Optimized for read-heavy workloads - **Merge-on-Read (MoR)**: Optimized for write-heavy streaming workloads - **Indexing**: Built-in indexes (Bloom filters, HBase index) for fast upserts **Strategic Value:** Hudi is essential for modern streaming data architectures: - **CDC Pipelines**: Capture database changes and sync to data lake in real-time - **Streaming ETL**: Process streaming data with exactly-once semantics - **Late Data Handling**: Handle late-arriving data efficiently - **Data Lake Compaction**: Automatic small file compaction and cleaning --- **中文:** Linkis目前缺乏对Apache Hudi的原生支持,Hudi是将数据库和数据仓库能力带到数据湖的流式数据湖平台。Hudi已成为流式数据湖中增量数据处理和变更数据捕获(CDC)的标准,被Uber、AWS、字节跳动等处理PB级流式数据的公司采用。 **市场需求:** - **流式数据湖领导者**: Apache顶级项目,起源于Uber用于PB级流式数据 - **企业采用**: Uber、AWS、字节跳动、Robinhood使用Hudi构建实时数据湖 - **AWS集成**: Amazon EMR和AWS Glue原生支持Hudi - **增量处理**: CDC和增量数据管道的顶级支持 - **ACID保证**: 记录级更新和删除的完整ACID事务 **技术优势:** - **Upsert和Delete**: 数据湖上记录级更新和删除的原生支持 - **增量查询**: ETL管道的高效增量读取 - **时间旅行**: 查询数据的历史版本 - **Copy-on-Write (CoW)**: 为读密集型工作负载优化 - **Merge-on-Read (MoR)**: 为写密集型流式工作负载优化 - **索引**: 内置索引(布隆过滤器、HBase索引)用于快速upsert **战略价值:** Hudi对现代流式数据架构至关重要: - **CDC管道**: 捕获数据库变更并实时同步到数据湖 - **流式ETL**: 使用精确一次语义处理流式数据 - **延迟数据处理**: 高效处理延迟到达的数据 - **数据湖压缩**: 自动小文件压缩和清理 ### What you expected to happen **English:** Linkis should provide an Apache Hudi engine plugin with the following capabilities: 1. **Hudi Table Operations:** - CREATE/DROP Hudi tables (CoW and MoR types) - INSERT INTO for initial data loading - UPSERT operations for record-level updates - DELETE operations for record-level deletions - MERGE INTO for complex upsert logic 2. **Table Types Support:** - Copy-on-Write (CoW) tables for read-optimized workloads - Merge-on-Read (MoR) tables for write-optimized workloads - Automatic table type selection based on workload 3. **Incremental Processing:** - Incremental query support: read only changed records since last checkpoint - Change log capture for CDC pipelines - Time travel queries to historical commits - Snapshot queries for point-in-time reads 4. **Index Management:** - Bloom filter index for upsert performance - HBase index for global lookups - Simple index for small datasets - Custom index plugin support 5. **Compaction and Cleaning:** - Async compaction for MoR tables - Inline compaction strategies - Cleaning of old file versions - Savepoint management for rollback 6. **Integration with Streaming:** - Flink integration for real-time ingestion - Spark Streaming integration - DeltaStreamer for Kafka/database CDC - Exactly-once semantics guarantee 7. **Integration with Linkis:** - Unified task submission interface - Resource management integration - Permission control integration - Metadata catalog integration --- **中文:** Linkis应该提供Apache Hudi引擎插件,具备以下能力: 1. **Hudi表操作:** - CREATE/DROP Hudi表(CoW和MoR类型) - INSERT INTO用于初始数据加载 - UPSERT操作用于记录级更新 - DELETE操作用于记录级删除 - MERGE INTO用于复杂upsert逻辑 2. **表类型支持:** - Copy-on-Write (CoW)表用于读优化工作负载 - Merge-on-Read (MoR)表用于写优化工作负载 - 基于工作负载自动选择表类型 3. **增量处理:** - 增量查询支持:仅读取自上次检查点以来的变更记录 - CDC管道的变更日志捕获 - 历史提交的时间旅行查询 - 时间点读取的快照查询 4. **索引管理:** - 布隆过滤器索引用于upsert性能 - HBase索引用于全局查找 - 小数据集的简单索引 - 自定义索引插件支持 5. **压缩和清理:** - MoR表的异步压缩 - 内联压缩策略 - 旧文件版本的清理 - 回滚的保存点管理 6. **与流式集成:** - Flink集成用于实时摄取 - Spark Streaming集成 - DeltaStreamer用于Kafka/数据库CDC - 精确一次语义保证 7. **与Linkis集成:** - 统一的任务提交接口 - 资源管理集成 - 权限控制集成 - 元数据目录集成 ### How to reproduce **English:** Current situation: 1. Users need to manually configure Spark/Flink with Hudi libraries 2. No dedicated Hudi table management interface in Linkis 3. Cannot leverage Linkis's unified task submission for Hudi operations 4. Limited support for Hudi-specific features (incremental queries, compaction) Use case example: ```sql -- Create Copy-on-Write table for read-heavy workload CREATE TABLE hudi_catalog.db.users ( user_id BIGINT, name STRING, email STRING, last_updated TIMESTAMP ) USING hudi TBLPROPERTIES ( 'type' = 'cow', 'primaryKey' = 'user_id', 'preCombineField' = 'last_updated' ); -- Create Merge-on-Read table for write-heavy streaming workload CREATE TABLE hudi_catalog.db.events ( event_id STRING, user_id BIGINT, event_type STRING, event_time TIMESTAMP, event_data STRING ) USING hudi TBLPROPERTIES ( 'type' = 'mor', 'primaryKey' = 'event_id', 'preCombineField' = 'event_time' ); -- Upsert data (insert or update based on primary key) INSERT INTO hudi_catalog.db.users VALUES (1001, 'Alice', '[email protected]', TIMESTAMP '2024-12-20 10:00:00'), (1002, 'Bob', '[email protected]', TIMESTAMP '2024-12-20 11:00:00'); -- Update existing records MERGE INTO hudi_catalog.db.users AS target USING (SELECT 1001 AS user_id, '[email protected]' AS email, TIMESTAMP '2024-12-20 12:00:00' AS last_updated) AS source ON target.user_id = source.user_id WHEN MATCHED THEN UPDATE SET email = source.email, last_updated = source.last_updated; -- Delete records DELETE FROM hudi_catalog.db.users WHERE user_id = 1002; -- Incremental query: read only changes since last checkpoint SELECT * FROM hudi_catalog.db.events WHERE _hoodie_commit_time > '20241220100000'; -- Time travel query SELECT * FROM hudi_catalog.db.users TIMESTAMP AS OF '2024-12-20 10:00:00'; -- Compaction (for MoR tables) CALL run_compaction(table => 'hudi_catalog.db.events', op => 'schedule_and_execute'); -- Clean old file versions CALL run_clean(table => 'hudi_catalog.db.events', retain_commits => 10); -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
