[GitHub] [incubator-seatunnel] JesseAtSZ opened a new issue #1461: [Feature][CDC] Support Flink CDC

GitBox Thu, 10 Mar 2022 03:55:43 -0800


JesseAtSZ opened a new issue #1461:
URL: https://github.com/apache/incubator-seatunnel/issues/1461

### Search before asking

- [X] I had searched in the
[feature](https://github.com/apache/incubator-seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
and found no similar feature requirement.

### Description

I have been following the seatunnel project for a long time. I am very
interested in this project and have downloaded and tried this project. I hope
this project can support CDC. Therefore, I have tried to implement Flink CDC
for this project, but I am not familiar with Fink and Java. After a period of
study, I know more about Flink and CDC and have some ideas, but I'm still not
familiar with Flink, so I'm here to discuss with you and hope you can give me
more comments:

Seatunnel currently supports two plug-in systems: Flink and spark. However,
there is still room for improvement in the existing plug-in system's support
for CDC. I suggest adding CDC system for the following reasons:
1. For the existing Flink system, in order to facilitate the processing of
the transform stage, the source is output in the form of DataStream<Row>, but
the CDC does not need transform, only source and sink.
2. CDC may need to synchronize schema changes, and the row contains only one
row of data and corresponding operations. Using DataStream<Row> cannot carry
schema information. Schema information needs to be carried in the source stage

So I think the new CDC system consists of two phases: source and sink.
Source outputs DataStream<String>, sink receives DataStream<String>, parses the
fields, and modifies the target database through Flink JDBC.

Taking MySQL synchronization to MySQL as an example, the key points of data
processing are as follows:
1. The custom serializer SeatunnelDeserializationSchema parses the
SourceRecord (debezium changelogevent) into a JSON string.
2. The op field identifies the operation of the row data, the value field
identifies the change of the data field, and the schema field identifies the
change of the table structure.
3. The SQL statements of insert, delete and update are spliced according to
the schema field, and the data synchronization is realized through Flink JDBC.

Remaining problems:
1. Is it necessary to convert sourcerecord to DataStream<String> (JSON)? Is
there any other type of stream more suitable for SourceRecord conversion?
2. Can JDBC on the sink side only add, delete and modify data by splicing
SQL?
3. Is the license allowed? Flink CDC connectors relies on MySQL JDBC, which
is a GPL license, but Flink CDC connectors is an Apache 2.0 license

The above is my simple design of seatunnel supporting CDC. I'm really not
familiar with Flink (I've only studied it for about a week). I hope you can
give me more suggestions. Thank you very much!

### Usage Scenario

_No response_

### Related issues

_No response_

### Are you willing to submit a PR?

- [X] Yes I am willing to submit a PR!

### Code of Conduct

- [X] I agree to follow this project's [Code of
Conduct](https://www.apache.org/foundation/policies/conduct)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-seatunnel] JesseAtSZ opened a new issue #1461: [Feature][CDC] Support Flink CDC

Reply via email to