JNSimba opened a new pull request, #64075:
URL: https://github.com/apache/doris/pull/64075

   ## What problem does this PR solve?
   
   Three related fixes for PostgreSQL CDC streaming jobs (`FROM ... TO ...` 
mode).
   
   ### 1. Multi-table snapshot data loss
   
   For a job with **multiple tables** and `offset=initial`, each 
snapshot/backfill split rewrites the shared per-job publication to only its own 
table (`ALTER PUBLICATION ... SET TABLE <one table>`), so during snapshot the 
publication keeps flipping between single tables. Logical decoding evaluates 
publication membership at the WAL position of each change, so a row written to 
a table while the publication temporarily excludes it is filtered out and 
**permanently lost**, violating at-least-once. Single-table jobs and 
user-provided publications are not affected.
   
   Fix: Doris creates and owns the publication as the full `include_tables` set 
up-front in `PostgresSourceReader.initialize`, and publication autocreate is 
always DISABLED so it is never rewritten per split.
   
   ### 2. Binlog "Duplicate key" when a same-named table exists in another 
schema
   
   `doReadTableColumn` filtered the columns returned by `getColumns` only by 
TABLE_NAME. Because the schema argument of `getColumns` is a LIKE pattern, a 
schema whose name matches via wildcard (e.g. `_`) and contains a same-named 
table leaks its columns, which then collide on column name and throw 
`IllegalStateException: Duplicate key` on the binlog reader.
   
   Fix: also compare SCHEMA_NAME.
   
   ### 3. Stabilize flaky cases
   
   - `test_streaming_insert_job_fetch_meta_error`: assert PAUSED status and 
error message inside the poll (the job is auto-resumable and oscillates), 
instead of a separate read afterwards.
   - `test_streaming_postgres_job_special_offset`: use a UNIQUE-key table, wait 
for the full snapshot before asserting, and explicitly assert rows before the 
ALTER LSN are skipped.
   
   ### New regression
   
   `test_streaming_postgres_job_snapshot_with_concurrent_dml_multi_table`: 
multi-table snapshot with concurrent DML on every table, asserting no row of 
either table is lost (covers fix #1).
   
   ## Release note
   
   Fix PostgreSQL CDC multi-table snapshot data loss and a binlog duplicate-key 
failure.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to