JNSimba opened a new pull request, #64075: URL: https://github.com/apache/doris/pull/64075
## What problem does this PR solve? Three related fixes for PostgreSQL CDC streaming jobs (`FROM ... TO ...` mode). ### 1. Multi-table snapshot data loss For a job with **multiple tables** and `offset=initial`, each snapshot/backfill split rewrites the shared per-job publication to only its own table (`ALTER PUBLICATION ... SET TABLE <one table>`), so during snapshot the publication keeps flipping between single tables. Logical decoding evaluates publication membership at the WAL position of each change, so a row written to a table while the publication temporarily excludes it is filtered out and **permanently lost**, violating at-least-once. Single-table jobs and user-provided publications are not affected. Fix: Doris creates and owns the publication as the full `include_tables` set up-front in `PostgresSourceReader.initialize`, and publication autocreate is always DISABLED so it is never rewritten per split. ### 2. Binlog "Duplicate key" when a same-named table exists in another schema `doReadTableColumn` filtered the columns returned by `getColumns` only by TABLE_NAME. Because the schema argument of `getColumns` is a LIKE pattern, a schema whose name matches via wildcard (e.g. `_`) and contains a same-named table leaks its columns, which then collide on column name and throw `IllegalStateException: Duplicate key` on the binlog reader. Fix: also compare SCHEMA_NAME. ### 3. Stabilize flaky cases - `test_streaming_insert_job_fetch_meta_error`: assert PAUSED status and error message inside the poll (the job is auto-resumable and oscillates), instead of a separate read afterwards. - `test_streaming_postgres_job_special_offset`: use a UNIQUE-key table, wait for the full snapshot before asserting, and explicitly assert rows before the ALTER LSN are skipped. ### New regression `test_streaming_postgres_job_snapshot_with_concurrent_dml_multi_table`: multi-table snapshot with concurrent DML on every table, asserting no row of either table is lost (covers fix #1). ## Release note Fix PostgreSQL CDC multi-table snapshot data loss and a binlog duplicate-key failure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
