liaorui opened a new pull request, #8096: URL: https://github.com/apache/inlong/pull/8096
…ssion ### Prepare a Pull Request *(Change the title refer to the following example)* - Title Example: [INLONG-XYZ][Component] Title of the pull request *(The following *XYZ* should be replaced by the actual [GitHub Issue](https://github.com/apache/inlong/issues) number)* - Fixes #8092 ### Motivation *Explain here the context, and why you're making that change. What is the problem you're trying to solve?* 1. Hive connector can only transmit data from source cdc to one hive table now. It may support all database and multiple tables data transmission with `sink.multiple.enable` and other `sink.multiple.*` options. You can find more description in InLong sort document. 2. Hive connector only supports hive 3.x version. Now it can support hive 2.x by `mvn clean install -pl org.apache.inlong:sort-connector-hive -am -DskipTests -Phive2`. Maven profile `hive2` in pom.xml imports `hive-exe:2.2.0` denpendency. ### Modifications *Describe the modifications you've done.* 1. Flink official `flink-connector-hive` connector can only write one hive table at once, as it reads one batch data and write into one HDFS path. In all db migration scene, we want it to read data from multiple source tables and write them into multiple hive tables. Parameter `org.apache.flink.core.fs.Path flinkPath`, in `openNewInProgressFile` method of `HadoopPathBasedBucketWriter` class, represents HDFS path, we should compose `flinkPath` with sink database and table dynamically. So we rewrite these classes `HadoopPathBasedPartFileWriter`, `HadoopPathBasedBulkFormatBuilder`, `DefaultHadoopFileCommitterFactory`, `DefaultHadoopFileCommitterFactory` and `PartitionCommitter`. 2. InLong hive connector can automatically create tables by inferring table schema from source CDC data. We cache `HiveWriterFactory` object for each hive table. `HiveWriterFactory` has `HiveShim` object which can be used to create or alter hive table. `HiveWriterFactory` also has columns information of hive table. `HiveTableUtil` is a tool class to create or alter hive table with columns information. `sqlType` is from source CDC data with debezium format. `cdc-base` module has tools to change `sqlType` into Flink `RowField`. We should change Flink `RowField` into hive dialect types. Hive connector can create hive table with time partition. Now the default partition name is `pt`, you can change this name by setting `sink.partition.name`. Hive connector supports three partition policies, which are `PROC_TIME`, `ASSIGN_FIELD` and `NONE`. With `PROC_TIME` partition policy, we format current system time millis with `yyyy-MM-dd` time format and set it to `pt` partition. You can modify time format with `partition.time-extractor.timestamp-pattern` option. With `ASSIGN_FIELD` partition policy, it will read field value from source CDC data with `source.partition.field.name` field name, and format it with `partition.time-extractor.timestamp-pattern` time format. For example, mysql has a `create_time` field with `2023-05-25 18:00:00`, hive connector can read this field value and format it to `20230525` and write to HDFS in `hdfs://xxxx/database/table/pt=20230525` path. With `NONE` partition policy, hive connector create table without partitions. Refer to `iceberg-connector`, hive connector only supports to add new fields by inferring `sqlType` from source CDC data. It cannot delete fields because hive does not support it. Hive connector also does not support other DDL schema auto changing. 3. Before this PR, InLong hive connector only supports hive 3.x version, as it depends on hive-exe 3.1.1. Now we can package different jars using `mvn clean install -pl org.apache.inlong:sort-connector-hive -am -DskipTests -Phive2`. `-Phive2` represents a maven profile which imports hive-exec 2.2.0 dependency. If you want to package jar matching hive 3.x, just remove `-Phive2` in maven command. ### Verifying this change *(Please pick either of the following options)* - [ ] This change is a trivial rework/code cleanup without any test coverage. - [ ] This change is already covered by existing tests, such as: *(please describe tests)* - [ ] This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end deployment with large payloads (10MB)* - *Extended integration test for recovery after broker failure* ### Documentation - Does this pull request introduce a new feature? (yes / no) - If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented) - If a feature is not applicable for documentation, explain why? - If a feature is not documented yet in this PR, please create a follow-up issue for adding the documentation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
