loserwang1024 commented on PR #4233:
URL: https://github.com/apache/flink-cdc/pull/4233#issuecomment-4053361936

   I have reviewd this PR(by @linjianchang )and 
https://github.com/apache/flink-cdc/pull/4259 by @zml1206 . Thanks for your 
hard work!
   
   I have also done some research on the Debezium codebase and PostgreSQL 
replication, and I have another design idea that prefers to leverage 
PostgreSQL’s native capabilities more fully. So I’d like to discuss it with you.
   
   ------------------------------------------------------
   
   ### PostgreSQL Pgoutput's relation message in logical replication
   As document of  PostgreSQL 16 says:  [Section 55.5.3 - Logical Replication 
Protocol Message 
Flow](https://www.postgresql.org/docs/16/protocol-logical-replication.html):
   
   >  "Every DML message contains a relation OID, identifying the publisher's 
relation that was acted on. Before the first DML message for a given relation 
OID, a Relation message will be sent, describing the schema of that relation. 
Subsequently, a new Relation message will be sent if the relation's definition 
has changed since the last Relation message was sent for it."
   
   
   When DDL changes are executed in PostgreSQL, no corresponding logs are 
generated. However, if the pgoutput sender is about to send the first DML 
message for a new schema, it will first send a Relation message.
   
   ### how Debezium use it?
   Depending on The decoder plug-in, schema updates take two completely 
different paths:
   ● pgoutput: Sends a correlation message before each DML event,You can use 
the applySchemaChangesForTable to actively update schema in advance. 
shouldSchemaBeSynchronized() returns false, so synchronizeTableSchema() is an 
empty operation for DML events.
   ● decoderbufs: If no RELATION message is sent, shouldSchemaBeSynchronized() 
returns true (the default value). The schema is synchronized by comparing the 
message column with the in-memory schema in a reactive manner.
   Thus, if it is pgoutput, we can send schema change events on demand without 
comparing each message?
   
   ### What cdc need to do?
   Therefore, my personal opinion:
   1. Extend the PostgresSchema and pass in a dispatcher. When a correlation 
message is received,Put the schema in the event queue as a special message
   2. When Postgres RecordEmitter receives the schema event, it:
   3. Update schemas in split,For persistence (in the current master branch, 
since pg cdc does not have a schema change event, the information in schemas 
will never be updated)
   4. Compare table changes and send scheme ddl in yaml.
   <img width="2442" height="1254" alt="image" 
src="https://github.com/user-attachments/assets/89ed012e-867f-46d1-be43-2ce54d464220";
 />
   
   
   In this way, we can avoid intrusive changes to PostgreSQL through triggers, 
and there is also no need to compare schemas for every message. More 
importantly, it updates the schemas stored in the split, so schema consistency 
can still be guaranteed after state recovery or restart.
   
   
   @linjianchang @zml1206 @leonardBang , WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to