On Mon, 19 Feb 2024 at 11:15, Hayato Kuroda (Fujitsu) <kuroda.hay...@fujitsu.com> wrote: > > Dear hackers, > > > Since it may be useful, I will post top-up patch on Monday, if there are no > > updating. > > And here are top-up patches. Feel free to check and include. > > v22-0001: Same as v21-0001. > === rebased patches === > v22-0002: Update docs per recent changes. Same as v20-0002. > v22-0003: Add check versions of the target. Extracted from v20-0003. > v22-0004: Remove -S option. Mostly same as v20-0009, but commit massage was > slightly changed. > === Newbie === > V22-0005: Addressed my comments which seems to be trivial[1]. > Comments #1, 3, 4, 8, 10, 14, 17 were addressed here. > v22-0006: Consider the scenario when commands are failed after the recovery. > drop_subscription() is removed and some messages are added per [2]. > V22-0007: Revise server_is_in_recovery() per [1]. Comments #5, 6, 7, were > addressed here. > V22-0008: Fix a strange report when physical_primary_slot is null. Per > comment #9 [1]. > V22-0009: Prohibit reuse publications when it has already existed. Per > comments #11 and 12 [1]. > V22-0010: Avoid to call PQclear()/PQfinish()/pg_free() if the process exits > soon. Per comment #15 [1]. > V22-0011: Update testcode. Per comments #17- [1]. > > I did not handle below points because I have unclear points. > > a. > This patch set cannot detect the disconnection between the target (standby) > and > source (primary) during the catch up. Because the connection status must be > gotten > at the same time (=in the same query) with the recovery status, but now it is > now an > independed function (server_is_in_recovery()). > > b. > This patch set cannot detect the inconsistency reported by Shubham [3]. I > could not > come up with solutions without removing -P... >
Few comments for v22-0001 patch: 1) The second "if (strcmp(PQgetvalue(res, 0, 1), "t") != 0)"" should be if (strcmp(PQgetvalue(res, 0, 2), "t") != 0): + if (strcmp(PQgetvalue(res, 0, 1), "t") != 0) + { + pg_log_error("permission denied for database %s", dbinfo[0].dbname); + return false; + } + if (strcmp(PQgetvalue(res, 0, 1), "t") != 0) + { + pg_log_error("permission denied for function \"%s\"", + "pg_catalog.pg_replication_origin_advance(text, pg_lsn)"); + return false; + } 2) pg_createsubscriber fails if a table is parallely created in the primary node: 2024-02-20 14:38:49.005 IST [277261] LOG: database system is ready to accept connections 2024-02-20 14:38:54.346 IST [277270] ERROR: relation "public.tbl5" does not exist 2024-02-20 14:38:54.346 IST [277270] STATEMENT: CREATE SUBSCRIPTION pg_createsubscriber_5_277236 CONNECTION ' dbname=postgres' PUBLICATION pg_createsubscriber_5 WITH (create_slot = false, copy_data = false, enabled = false) If we are not planning to fix this, at least it should be documented 3) Error conditions is verbose mode has invalid error message like "out of memory" messages like in below: pg_createsubscriber: waiting the postmaster to reach the consistent state pg_createsubscriber: postmaster reached the consistent state pg_createsubscriber: dropping publication "pg_createsubscriber_5" on database "postgres" pg_createsubscriber: creating subscription "pg_createsubscriber_5_278343" on database "postgres" pg_createsubscriber: error: could not create subscription "pg_createsubscriber_5_278343" on database "postgres": out of memory 4) In error cases we try to drop this publication again resulting in error: + /* + * Since the publication was created before the consistent LSN, it is + * available on the subscriber when the physical replica is promoted. + * Remove publications from the subscriber because it has no use. + */ + drop_publication(conn, &dbinfo[i]); Which throws these errors(because of drop publication multiple times): pg_createsubscriber: dropping publication "pg_createsubscriber_5" on database "postgres" pg_createsubscriber: error: could not drop publication "pg_createsubscriber_5" on database "postgres": ERROR: publication "pg_createsubscriber_5" does not exist pg_createsubscriber: dropping publication "pg_createsubscriber_5" on database "postgres" pg_createsubscriber: dropping the replication slot "pg_createsubscriber_5_278343" on database "postgres" 5) In error cases, wait_for_end_recovery waits even though it has identified that the replication between primary and standby is stopped: +/* + * Is recovery still in progress? + * If the answer is yes, it returns 1, otherwise, returns 0. If an error occurs + * while executing the query, it returns -1. + */ +static int +server_is_in_recovery(PGconn *conn) +{ + PGresult *res; + int ret; + + res = PQexec(conn, "SELECT pg_catalog.pg_is_in_recovery()"); + + if (PQresultStatus(res) != PGRES_TUPLES_OK) + { + PQclear(res); + pg_log_error("could not obtain recovery progress"); + return -1; + } + You can simulate this by stopping the primary just before wait_for_end_recovery and you will see these error messages, but pg_createsubscriber will continue to wait: pg_createsubscriber: error: could not obtain recovery progress pg_createsubscriber: error: could not obtain recovery progress pg_createsubscriber: error: could not obtain recovery progress pg_createsubscriber: error: could not obtain recovery progress ... Regards, Vignesh