JNSimba opened a new pull request, #64850:
URL: https://github.com/apache/doris/pull/64850

   ## Proposed changes
   
   ### Problem
   
   For the PostgreSQL streaming job (from-to / at-least-once path), 
schema-change
   (ADD/DROP column) detection was done **per DML record**: every change row's 
Kafka
   Connect "after" schema was diffed against the stored schema, and on any 
divergence
   a JDBC round-trip fetched the fresh PG schema to obtain accurate column 
types.
   
   This has two drawbacks:
   - a name diff runs on the hot path for every DML record;
   - accurate column types require an out-of-band JDBC fetch on every detected 
change.
   
   ### What this PR does
   
   Switch PG schema-change detection to be **event-driven**, sourced from 
pgoutput
   Relation messages (surfaced as schema-change records on the stream). On each
   Relation event the carried full post-change table schema is diffed against 
the
   stored baseline (the Doris table's current schema, loaded from FE) to derive
   ADD/DROP column DDL. The accurate column type / nullability / default come 
from
   the Relation-carried schema, so the per-record diff and the JDBC fetch are 
both
   removed.
   
   Behavior preserved:
   - **Baseline** is established by the table's first Relation event (covers 
streams
     that start directly from an offset without a snapshot); no DDL is emitted 
for it.
   - **Rename guard**: a simultaneous ADD+DROP is treated as a possible RENAME 
and no
     DDL is emitted, to avoid data loss; the column must be renamed manually in 
Doris.
   - **Excluded columns** are skipped for both ADD and DROP.
   - **NOT NULL without a usable default** is added as NULLABLE (incoming DML 
still
     carries the real values).
   - The DDL is applied only on the from-to write path (unchanged; TVF mode 
does not
     apply schema-change records).
   
   Default-value handling (`stripPgDefault`) is best-effort: string / numeric /
   boolean literals and `now()/current_timestamp/localtimestamp` are mapped; any
   other expression degrades to no static default rather than emitting a wrong
   DEFAULT clause.
   
   Changes:
   - `PostgresDebeziumJsonDeserializer`: event-driven `handleSchemaChangeEvent`,
     replacing the per-DML diff and the JDBC schema refresher.
   - `JdbcIncrementalSourceReader`: pass schema-change records through to the
     deserializer without advancing the offset.
   - `PostgresSourceReader`: enable schema-change records on the source; drop 
the now
     unused JDBC schema-refresher injection.
   - `SchemaChangeHelper`: remove the now-unused name-only diff helper.
   
   ### Tests
   
   - Unit tests cover: baseline establishment, no-op idempotency, ADD, DROP, the
     ADD+DROP rename guard, excluded-column ADD/DROP skipping, and default-value
     parsing (parenthesised/`::`-containing string literals, 
unrecognized-keyword
     downgrade).
   - End-to-end ADD/DROP/DEFAULT/NOT NULL regression suites for the PG 
streaming job.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to