loserwang1024 commented on PR #4233: URL: https://github.com/apache/flink-cdc/pull/4233#issuecomment-4053361936
I have reviewd this PR(by @linjianchang )and https://github.com/apache/flink-cdc/pull/4259 by @zml1206 . Thanks for your hard work! I have also done some research on the Debezium codebase and PostgreSQL replication, and I have another design idea that prefers to leverage PostgreSQL’s native capabilities more fully. So I’d like to discuss it with you. ------------------------------------------------------ ### PostgreSQL Pgoutput's relation message in logical replication As document of PostgreSQL 16 says: [Section 55.5.3 - Logical Replication Protocol Message Flow](https://www.postgresql.org/docs/16/protocol-logical-replication.html): > "Every DML message contains a relation OID, identifying the publisher's relation that was acted on. Before the first DML message for a given relation OID, a Relation message will be sent, describing the schema of that relation. Subsequently, a new Relation message will be sent if the relation's definition has changed since the last Relation message was sent for it." When DDL changes are executed in PostgreSQL, no corresponding logs are generated. However, if the pgoutput sender is about to send the first DML message for a new schema, it will first send a Relation message. ### how Debezium use it? Depending on The decoder plug-in, schema updates take two completely different paths: ● pgoutput: Sends a correlation message before each DML event,You can use the applySchemaChangesForTable to actively update schema in advance. shouldSchemaBeSynchronized() returns false, so synchronizeTableSchema() is an empty operation for DML events. ● decoderbufs: If no RELATION message is sent, shouldSchemaBeSynchronized() returns true (the default value). The schema is synchronized by comparing the message column with the in-memory schema in a reactive manner. Thus, if it is pgoutput, we can send schema change events on demand without comparing each message? ### What cdc need to do? Therefore, my personal opinion: 1. Extend the PostgresSchema and pass in a dispatcher. When a correlation message is received,Put the schema in the event queue as a special message 2. When Postgres RecordEmitter receives the schema event, it: 3. Update schemas in split,For persistence (in the current master branch, since pg cdc does not have a schema change event, the information in schemas will never be updated) 4. Compare table changes and send scheme ddl in yaml. <img width="2442" height="1254" alt="image" src="https://github.com/user-attachments/assets/89ed012e-867f-46d1-be43-2ce54d464220" /> In this way, we can avoid intrusive changes to PostgreSQL through triggers, and there is also no need to compare schemas for every message. More importantly, it updates the schemas stored in the split, so schema consistency can still be guaranteed after state recovery or restart. @linjianchang @zml1206 @leonardBang , WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
