Tomás Miguez created FLINK-39622:
------------------------------------
Summary: [postgres] CustomPostgresSchema re-reads JDBC metadata
for every split, causing O(N²) snapshot startup
Key: FLINK-39622
URL: https://issues.apache.org/jira/browse/FLINK-39622
Project: Flink
Issue Type: Bug
Components: Flink CDC
Affects Versions: 1.20.2
Reporter: Tomás Miguez
Description
CustomPostgresSchema#readTableSchema
(`flink-connector-postgres-cdc/.../utils/CustomPostgresSchema.java`) calls
jdbcConnection.readSchema(...) with the full set of captured table IDs, so a
single call already populates Tables for every captured table. However the
subsequent loop only iterates the tableIds argument (the subset requested for
the current split) and only caches entries from that subset into
schemasByTableId.
As a result, when the snapshot phase requests schemas one split at a time, each
call re-reads JDBC metadata for all captured tables but throws most of the work
away. With N captured tables this becomes O(N²) JDBC metadata lookups during
snapshot startup, which is very visible on Postgres instances with many
captured tables (pg_catalog query load spikes). This makes Flink CDC unusable
on applications that base their multitenancy on separation by schemas, for
example, which is our case.
Steps to reproduce
1. Configure the Postgres CDC source to capture a large number of tables (e.g.
200+).
2. Start a fresh job (snapshot phase).
3. Observe snapshot startup latency and pg_catalog query volume scale with N².
Proposed fix
Iterate every table that readSchema discovered (tables.tableIds()) and cache
all of them in schemasByTableId, while only adding the originally-requested
subset to the returned tableChanges. Subsequent splits can then be served from
the cache without another full metadata scan.
A patch is available — happy to open a PR.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)