morozov commented on code in PR #3929: URL: https://github.com/apache/flink-cdc/pull/3929#discussion_r1993805772
########## flink-cdc-connect/flink-cdc-source-connectors/flink-connector-postgres-cdc/src/main/java/org/apache/flink/cdc/connectors/postgres/source/utils/PostgresQueryUtils.java: ########## @@ -69,12 +69,17 @@ public static long queryApproximateRowCnt(JdbcConnection jdbc, TableId tableId) // https://stackoverflow.com/questions/7943233/fast-way-to-discover-the-row-count-of-a-table-in-postgresql // NOTE: it requires ANALYZE or VACUUM to be run first in PostgreSQL. final String query = - String.format( - "SELECT reltuples::bigint FROM pg_class WHERE oid = to_regclass('%s')", - tableId.toString()); - - return jdbc.queryAndMap( + "SELECT reltuples::bigint" + + " FROM pg_class c" + + " JOIN pg_namespace n ON n.oid = c.relnamespace" Review Comment: This is part of the bug fix. Let's assume we have a table named `Customers` in the schema named `public`. The previous logic would work like this: 1. Use `TableId#toString()` to build the qualified name from the schema name and table name: `public.Customers`. 2. Use `to_regclass()` to parse the qualified name and find the table OID. 3. Fetch the number of rows in the table by OID. The problem is that the value passed to `to_regclass()` is parsed as a Postgres SQL expression but isn't formatted as such. It doesn't account for case insensitivity of unquoted identifiers (see the [documentation](https://www.postgresql.org/docs/current/sql-syntax-lexical.html)): > Quoting an identifier also makes it case-sensitive, whereas unquoted names are always folded to lower case. For example, the identifiers FOO, foo, and "foo" are considered the same by PostgreSQL, but "Foo" and "FOO" are different from these three and each other. For the original query to work, it should have looked like: ```sql SELECT reltuples::bigint FROM pg_class WHERE oid = to_regclass('public."Customers"') ``` So it's effectively SQL in SQL. The new query is much more straightforward and isn't prone to this issue. It just selects the number of rows for the table whose schema name is `public` and name is `Customers` by avoiding the "SQL in SQL" situation. It is possible to make the old query work, but it would involve building SQL by using `quote()` and then parsing via `to_regclass()` which is redundant (the first two steps in the original query effectively eliminate each other). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org