[I] [Feature][connector-hive] hive connector support overwrite mode [seatunnel]

via GitHub Tue, 15 Oct 2024 00:52:09 -0700


Adamyuanyuan opened a new issue, #7843:
URL: https://github.com/apache/seatunnel/issues/7843


   ### Search before asking
   
   - [X] I had searched in the 
[feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
 and found no similar feature requirement.
   
   
   ### Description
   
   Flag to decide whether to use overwrite mode when inserting data into Hive. 
If set to true, for non-partitioned tables, the existing data in the table will 
be deleted before inserting new data. For partitioned tables, the data in the 
relevant partition will be deleted before inserting new data;
   
   When performing Hive insert operations, the current mode is append, but in 
reality, there may be requirements for overwriting the data, similar to `insert 
overwrite`, or deleting before insertion. There are several implementation 
approaches, such as:
   
   1. **Using Scheduling Workflows**: A temporary solution for data processing 
involves configuring a workflow, dragging a workflow to first delete the 
corresponding table, and then performing the insertion.
   2. **Upper-Layer Data Integration Products**: Through pipelines or similar 
methods, data is deleted before insertion.
   3. **Native Support for "Overwrite" Mode in Seatunnel**: Currently, 
implementing this feature directly in the Seatunnel core is the most 
convenient. By leveraging Seatunnel's two-phase commit logic, data is first 
written to a temporary directory, then deleted (using `deleteFile(directory)`), 
and finally renamed. This approach ensures better data consistency, with the 
time between deleting and renaming the directory being in milliseconds. It 
leverages existing utility classes, resulting in minimal code changes and 
significantly better performance compared to upper-layer methods.
   
   During the implementation, the logic of Flink's `overwrite` operator was 
referenced.
   
   Simply adding an `overwrite` parameter on the Hive side (defaulting to 
`false`) would suffice.
   
   ```
   sink {
     Hive {
       table_name = "default.test_overwrite_1"
       metastore_uri = "thrift://hadoop-master1.orb.local:9083"
       overwrite = true
       source_table_name = "source_table"
     }
   }
   ```
   
   ### Expected Logic
   
   | Operation Type               | Logic Description                           
          |
   
|-----------------------------|-------------------------------------------------------|
   | Hive Table Overwrite         | Delete old table data, write new data       
          |
   | Hive Partitioned Table Overwrite | Delete only the relevant partition data 
to be overwritten, write corresponding partition data |
   
   
   -----------
   
   目前进行hive插入的时候，模式是append的方式，但是实际上，可能有的需求是需要覆盖写入的，类似于insert 
overwrite，或者说插入之前先删除。这有很多种实现思路，比如：
   1. 借助调度工作流：数据处理配置工作流的临时方案，拖一个工作流，先删除对应的表，然后再插入；
   2. 上层数据集成产品侧通过流水线或类似方式，在插入前先删除数据，再写入数据；
   3. 通过Seatunnel原生支持“覆盖写入”模式；
   
目前对比下来，直接在Seatunnel底层实现这个功能最方便，借助Seatunnel的二阶段提交逻辑，先写到临时目录，再删（deleteFile（目录），再rename，数据一致性更好，删目录和rename目录之间的时间为毫秒级，借助现成的工具类，代码改动比较少，效果远好于通过上层的方式；
   实现的过程中，参考了 Flink的overwrite算子的逻辑。
   
   只需要在hive侧新增一个overwrite参数即可(默认为false)：
   ```
   sink {
     Hive {
       table_name = "default.test_overwrite_1"
       metastore_uri = "thrift://hadoop-master1.orb.local:9083"
       overwrite = true
       source_table_name = "source_table"
     }
   }
   ```
   
   ### 期望逻辑
   
   | 操作类型               | 逻辑描述                                     |
   |----------------------|--------------------------------------------|
   | Hive表覆盖写入          | 删除旧表数据，写入新数据                        |
   | Hive分区表覆盖写入       | 只删除要覆写相关的分区数据，写入对应的分区数据          |
   
   
   
   ### Usage Scenario
   
   When performing Hive insert operations, the current mode is append, but in 
reality, there may be requirements for overwriting the data, similar to `insert 
overwrite`, or deleting before insertion. 
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Feature][connector-hive] hive connector support overwrite mode [seatunnel]

Reply via email to