[GitHub] [inlong] liaorui opened a new pull request, #8096: [INLONG-8092][Sort]Hive connector supports all database and multiple tables data transmi…

via GitHub Thu, 25 May 2023 05:44:11 -0700


liaorui opened a new pull request, #8096:
URL: https://github.com/apache/inlong/pull/8096


   …ssion
   
   ### Prepare a Pull Request
   *(Change the title refer to the following example)*
   
   - Title Example: [INLONG-XYZ][Component] Title of the pull request
   
   *(The following *XYZ* should be replaced by the actual [GitHub 
Issue](https://github.com/apache/inlong/issues) number)*
   
   - Fixes #8092
   
   ### Motivation
   
   *Explain here the context, and why you're making that change. What is the 
problem you're trying to solve?*
   
   1. Hive connector can only transmit data from source cdc to one hive table 
now. It may support all database and multiple tables data transmission with 
`sink.multiple.enable` and other `sink.multiple.*` options. You can find more 
description in InLong sort document.
   
   2. Hive connector only supports hive 3.x version. Now it can support hive 
2.x by `mvn clean install -pl org.apache.inlong:sort-connector-hive -am 
-DskipTests -Phive2`. Maven profile `hive2` in pom.xml imports `hive-exe:2.2.0` 
denpendency.
   
   ### Modifications
   
   *Describe the modifications you've done.*
   
   1. Flink official `flink-connector-hive` connector can only write one hive 
table at once, as it reads one batch data and write into one HDFS path. In all 
db migration scene, we want it to read data from multiple source tables and 
write them into multiple hive tables. 
   
       Parameter `org.apache.flink.core.fs.Path flinkPath`, in 
`openNewInProgressFile` method of `HadoopPathBasedBucketWriter` class, 
represents HDFS path, we should compose `flinkPath` with sink database and 
table dynamically. So we rewrite these classes `HadoopPathBasedPartFileWriter`, 
`HadoopPathBasedBulkFormatBuilder`, `DefaultHadoopFileCommitterFactory`, 
`DefaultHadoopFileCommitterFactory` and `PartitionCommitter`.
   
   2. InLong hive connector can automatically create tables by inferring table 
schema from source CDC data. We cache `HiveWriterFactory` object for each hive 
table.  `HiveWriterFactory` has `HiveShim` object which can be used to create 
or alter hive table. `HiveWriterFactory` also has columns information of hive 
table. `HiveTableUtil` is a tool class to create or alter hive table with 
columns information. `sqlType` is from source CDC data with debezium format. 
`cdc-base` module has tools to change `sqlType` into Flink `RowField`. We 
should change Flink `RowField` into hive dialect types.
   
       Hive connector can create hive table with time partition. Now the 
default partition name is `pt`, you can change this name by setting 
`sink.partition.name`. 
   
       Hive connector supports three partition policies, which are `PROC_TIME`, 
`ASSIGN_FIELD` and `NONE`. With `PROC_TIME` partition policy, we format current 
system time millis with `yyyy-MM-dd` time format and set it to `pt` partition. 
You can modify time format with `partition.time-extractor.timestamp-pattern` 
option. 
   
       With `ASSIGN_FIELD` partition policy, it will read field value from 
source  CDC data with `source.partition.field.name` field name, and format it 
with `partition.time-extractor.timestamp-pattern` time format. For example, 
mysql has a `create_time` field with `2023-05-25 18:00:00`, hive connector can 
read this field value and format it to `20230525` and write to HDFS in 
`hdfs://xxxx/database/table/pt=20230525` path.
   
       With `NONE` partition policy, hive connector create table without 
partitions.
   
       Refer to `iceberg-connector`, hive connector only supports to add new 
fields by inferring `sqlType` from source CDC data.  It cannot delete fields 
because hive does not support it. Hive connector also does not support other 
DDL schema auto changing.
   
   3. Before this PR, InLong hive connector only supports hive 3.x version, as 
it depends on hive-exe 3.1.1. Now we can package  different jars using `mvn 
clean install -pl org.apache.inlong:sort-connector-hive -am -DskipTests 
-Phive2`. `-Phive2` represents a maven profile which imports hive-exec 2.2.0 
dependency. 
   
       If you want to package jar matching hive 3.x, just remove `-Phive2` in 
maven command.
   
   
   ### Verifying this change
   
   *(Please pick either of the following options)*
   
   - [ ] This change is a trivial rework/code cleanup without any test coverage.
   
   - [ ] This change is already covered by existing tests, such as:
     *(please describe tests)*
   
   - [ ] This change added tests and can be verified as follows:
   
     *(example:)*
     - *Added integration tests for end-to-end deployment with large payloads 
(10MB)*
     - *Extended integration test for recovery after broker failure*
   
   ### Documentation
   
     - Does this pull request introduce a new feature? (yes / no)
     - If yes, how is the feature documented? (not applicable / docs / JavaDocs 
/ not documented)
     - If a feature is not applicable for documentation, explain why?
     - If a feature is not documented yet in this PR, please create a follow-up 
issue for adding the documentation
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [inlong] liaorui opened a new pull request, #8096: [INLONG-8092][Sort]Hive connector supports all database and multiple tables data transmi…

Reply via email to