JesseAtSZ opened a new issue #1461: URL: https://github.com/apache/incubator-seatunnel/issues/1461
### Search before asking - [X] I had searched in the [feature](https://github.com/apache/incubator-seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22) and found no similar feature requirement. ### Description I have been following the seatunnel project for a long time. I am very interested in this project and have downloaded and tried this project. I hope this project can support CDC. Therefore, I have tried to implement Flink CDC for this project, but I am not familiar with Fink and Java. After a period of study, I know more about Flink and CDC and have some ideas, but I'm still not familiar with Flink, so I'm here to discuss with you and hope you can give me more comments: Seatunnel currently supports two plug-in systems: Flink and spark. However, there is still room for improvement in the existing plug-in system's support for CDC. I suggest adding CDC system for the following reasons: 1. For the existing Flink system, in order to facilitate the processing of the transform stage, the source is output in the form of DataStream<Row>, but the CDC does not need transform, only source and sink. 2. CDC may need to synchronize schema changes, and the row contains only one row of data and corresponding operations. Using DataStream<Row> cannot carry schema information. Schema information needs to be carried in the source stage So I think the new CDC system consists of two phases: source and sink. Source outputs DataStream<String>, sink receives DataStream<String>, parses the fields, and modifies the target database through Flink JDBC. Taking MySQL synchronization to MySQL as an example, the key points of data processing are as follows: 1. The custom serializer SeatunnelDeserializationSchema parses the SourceRecord (debezium changelogevent) into a JSON string. 2. The op field identifies the operation of the row data, the value field identifies the change of the data field, and the schema field identifies the change of the table structure. 3. The SQL statements of insert, delete and update are spliced according to the schema field, and the data synchronization is realized through Flink JDBC. Remaining problems: 1. Is it necessary to convert sourcerecord to DataStream<String> (JSON)? Is there any other type of stream more suitable for SourceRecord conversion? 2. Can JDBC on the sink side only add, delete and modify data by splicing SQL? 3. Is the license allowed? Flink CDC connectors relies on MySQL JDBC, which is a GPL license, but Flink CDC connectors is an Apache 2.0 license The above is my simple design of seatunnel supporting CDC. I'm really not familiar with Flink (I've only studied it for about a week). I hope you can give me more suggestions. Thank you very much! ### Usage Scenario _No response_ ### Related issues _No response_ ### Are you willing to submit a PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
