JesseAtSZ opened a new issue #1461:
URL: https://github.com/apache/incubator-seatunnel/issues/1461


   ### Search before asking
   
   - [X] I had searched in the 
[feature](https://github.com/apache/incubator-seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
 and found no similar feature requirement.
   
   
   ### Description
   
   I have been following the seatunnel project for a long time. I am very 
interested in this project and have downloaded and tried this project. I hope 
this project can support CDC. Therefore, I have tried to implement Flink CDC 
for this project, but I am not familiar with Fink and Java. After a period of 
study, I know more about Flink and CDC and have some ideas, but I'm still not 
familiar with Flink, so I'm here to discuss with you and hope you can give me 
more comments:
   
   Seatunnel currently supports two plug-in systems: Flink and spark. However, 
there is still room for improvement in the existing plug-in system's support 
for CDC. I suggest adding CDC system for the following reasons: 
   1. For the existing Flink system, in order to facilitate the processing of 
the transform stage, the source is output in the form of DataStream<Row>, but 
the CDC does not need transform, only source and sink.
   2. CDC may need to synchronize schema changes, and the row contains only one 
row of data and corresponding operations. Using DataStream<Row> cannot carry 
schema information. Schema information needs to be carried in the source stage
   
   So I think the new CDC system consists of two phases: source and sink. 
Source outputs DataStream<String>, sink receives DataStream<String>, parses the 
fields, and modifies the target database through Flink JDBC.
   
   Taking MySQL synchronization to MySQL as an example, the key points of data 
processing are as follows:
   1. The custom serializer SeatunnelDeserializationSchema parses the 
SourceRecord (debezium changelogevent) into a JSON string.
   2. The op field identifies the operation of the row data, the value field 
identifies the change of the data field, and the schema field identifies the 
change of the table structure.
   3. The SQL statements of insert, delete and update are spliced according to 
the schema field, and the data synchronization is realized through Flink JDBC.
   
   Remaining problems:
   1. Is it necessary to convert sourcerecord to DataStream<String> (JSON)? Is 
there any other type of stream more suitable for SourceRecord conversion?
   2. Can JDBC on the sink side only add, delete and modify data by splicing 
SQL?
   3. Is the license allowed? Flink CDC connectors relies on MySQL JDBC, which 
is a GPL license, but Flink CDC connectors is an Apache 2.0 license
   
   The above is my simple design of seatunnel supporting CDC. I'm really not 
familiar with Flink (I've only studied it for about a week). I hope you can 
give me more suggestions. Thank you very much!
   
   
   ### Usage Scenario
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to