It seems my previous message isn't visible in the thread. I can see Robert's reply, but not the original message he was responding to. I'm resending the message and all attachments to ensure it's saved in the thread.
On Thu, 13 Nov 2025 at 19:45, Alena Vinter <[email protected]> wrote: > Hi everyone! > > Robert, here's a realistic scenario where the issue occurs: > 1. Start with a primary and physical standby > 2. Convert the standby to a logical replica using `pg_createsubscriber` > 3. Later, create a new physical standby from a backup of this logical > replica > 4. The new standby fails to start because it cannot reach consistency point > > The root cause is that `pg_createsubscriber` leaves behind recovery > parameters that interfere with the new standby's startup process, causing > recovery to stop before reaching a consistency point. > > To demonstrate this, I've expanded the existing ' > 040_pg_createsubscriber.pl' test to include this scenario. I've also > attached a standalone TAP test below for easier verification of the > specific failure case. > > --- > Regards, > Alena Vinter >
041_node_from_backup_after_pg_createsubscriber.pl
Description: Perl program
From 4240b48b46e5b8890f62d6461417bef91eac6bad Mon Sep 17 00:00:00 2001 From: Alena Vinter <[email protected]> Date: Mon, 27 Oct 2025 13:02:51 +0700 Subject: [PATCH 2/2] doc: Add warning about leftover recovery parameters in 'pg_createsubscriber' --- doc/src/sgml/ref/pg_createsubscriber.sgml | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/doc/src/sgml/ref/pg_createsubscriber.sgml b/doc/src/sgml/ref/pg_createsubscriber.sgml index bb9cc72576c..fa14aeeef13 100644 --- a/doc/src/sgml/ref/pg_createsubscriber.sgml +++ b/doc/src/sgml/ref/pg_createsubscriber.sgml @@ -520,11 +520,19 @@ PostgreSQL documentation are added to avoid unexpected behavior during the recovery process such as end of the recovery as soon as a consistent state is reached (WAL should be applied until the replication start location) and multiple - recovery targets that can cause a failure. This step finishes once the - server ends standby mode and is accepting read-write transactions. If - <option>--recovery-timeout</option> option is set, - <application>pg_createsubscriber</application> terminates if recovery - does not end until the given number of seconds. + recovery targets that can cause a failure. + </para> + <para> + If an error occurs during this step, recovery parameters might remain in + the target server's configuration. You may need to manually remove these + leftover parameters before attempting to run. + <application>pg_createsubscriber</application> again. + </para> + <para> + This step finishes once the server ends standby mode and is accepting + read-write transactions. If <option>--recovery-timeout</option> option + is set, <application>pg_createsubscriber</application> terminates if + recovery does not end until the given number of seconds. </para> </step> -- 2.51.1
From 44f58736b59064aaf2d7c27910165e732135700d Mon Sep 17 00:00:00 2001 From: Alena Vinter <[email protected]> Date: Mon, 27 Oct 2025 13:02:02 +0700 Subject: [PATCH 1/2] Reseting recovery target parameters in `pg_createsubscriber` The utility sets recovery target params for correct recovery before conversion a physical replica to a logical one but does not reset them afterward. It may cause recovery failures in certain scenarios. For example, if recovery begins from a checkpoint where no WAL records need to be applied, the system may incorrectly determine that the recovery target was never reached because these parameters remain active. This change ensures all recovery parameters are properly reset after conversion to prevent such edge cases. --- src/bin/pg_basebackup/pg_createsubscriber.c | 100 +++++++++++++++--- .../t/040_pg_createsubscriber.pl | 64 +++++++++++ 2 files changed, 148 insertions(+), 16 deletions(-) diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c index df41836e70f..012660e718e 100644 --- a/src/bin/pg_basebackup/pg_createsubscriber.c +++ b/src/bin/pg_basebackup/pg_createsubscriber.c @@ -29,9 +29,12 @@ #include "fe_utils/string_utils.h" #include "fe_utils/version.h" #include "getopt_long.h" +#include "storage/fd.h" +#include "utils/elog.h" #define DEFAULT_SUB_PORT "50432" #define OBJECTTYPE_PUBLICATIONS 0x0001 +#define RECOVERY_CONF_PREFIX ".original" /* Command-line options */ struct CreateSubscriberOptions @@ -155,22 +158,67 @@ static char *subscriber_dir = NULL; static bool recovery_ended = false; static bool standby_running = false; +static bool recovery_params_set = false; + +static const char *recovery_conf_filename = NULL; /* cached recovery conf + * filename */ +static off_t recovery_conf_size = 0; /* size of original recovery conf file */ /* - * Cleanup objects that were created by pg_createsubscriber if there is an - * error. + * Clean up objects created by pg_createsubscriber. * - * Publications and replication slots are created on primary. Depending on the - * step it failed, it should remove the already created objects if it is - * possible (sometimes it won't work due to a connection issue). - * There is no cleanup on the target server. The steps on the target server are - * executed *after* promotion, hence, at this point, a failure means recreate - * the physical replica and start again. + * Publications and replication slots are created on the primary. Depending on + * the step where it failed, already created objects should be removed if + * possible (sometimes this won't work due to a connection issue). + * There is no cleanup on the target server *after* its promotion, because any + * failure at this point means recreating the physical replica and starting + * again. */ static void cleanup_objects_atexit(void) { + /* + * Restore the recovery config to its original contents, ensuring that the + * subscriber doesn't use outdated recovery parameters. + */ + if (recovery_params_set) + { + int err = 0; + char filename[MAXPGPATH]; + + snprintf(filename, MAXPGPATH, "%s/%s", subscriber_dir, recovery_conf_filename); + + /* + * 'recovery.conf' are fully rewritten during the 'recovery_setup' + * function, so we need to restore saved original contents + */ + if (strcmp(recovery_conf_filename, "recovery.conf") == 0) + { + char filename_with_original_contents[MAXPGPATH]; + + snprintf(filename_with_original_contents, MAXPGPATH, + "%s" RECOVERY_CONF_PREFIX, filename); + + err = durable_rename(filename_with_original_contents, filename, ERROR); + } + else /* postgresql.auto.conf */ + { + if ((err = truncate(filename, recovery_conf_size)) != 0) + { + pg_log_error("could not truncate file \"%s\" to %u: %m", + filename, (unsigned int) recovery_conf_size); + } + } + + if (err != 0) + { + pg_log_warning("recovery parameters were set but not properly reset on the target"); + pg_log_warning_hint("manual removal of recovery parameters is required from '%s'", + recovery_conf_filename); + } + } + if (success) return; @@ -1026,6 +1074,9 @@ check_subscriber(const struct LogicalRepInfo *dbinfo) pg_log_info("checking settings on subscriber"); conn = connect_database(dbinfo[0].subconninfo, true); + recovery_conf_filename = + PQserverVersion(conn) < MINIMUM_VERSION_FOR_RECOVERY_GUC ? + "recovery.conf" : "postgresql.auto.conf"; /* The target server must be a standby */ if (!server_is_in_recovery(conn)) @@ -1238,6 +1289,8 @@ setup_recovery(const struct LogicalRepInfo *dbinfo, const char *datadir, const c { PGconn *conn; PQExpBuffer recoveryconfcontents; + char filename[MAXPGPATH]; + struct stat st; /* * Despite of the recovery parameters will be written to the subscriber, @@ -1246,6 +1299,16 @@ setup_recovery(const struct LogicalRepInfo *dbinfo, const char *datadir, const c */ conn = connect_database(dbinfo[0].pubconninfo, true); + /* + * Before setting up the recovery parameters save the original size for + * restoring later. + */ + snprintf(filename, MAXPGPATH, "%s/%s", datadir, recovery_conf_filename); + if (stat(filename, &st) < 0) + pg_fatal("could not stat file \"%s\": %m", filename); + + recovery_conf_size = st.st_size; + /* * Write recovery parameters. * @@ -1277,17 +1340,22 @@ setup_recovery(const struct LogicalRepInfo *dbinfo, const char *datadir, const c appendPQExpBufferStr(recoveryconfcontents, "recovery_target_time = ''\n"); appendPQExpBufferStr(recoveryconfcontents, "recovery_target_xid = ''\n"); - if (dry_run) - { - appendPQExpBufferStr(recoveryconfcontents, "# dry run mode\n"); - appendPQExpBuffer(recoveryconfcontents, - "recovery_target_lsn = '%X/%08X'\n", - LSN_FORMAT_ARGS((XLogRecPtr) InvalidXLogRecPtr)); - } - else + if (!dry_run) { + if (strcmp(recovery_conf_filename, "recovery.conf") == 0) + { + char filename_with_original_contents[MAXPGPATH]; + + snprintf(filename_with_original_contents, MAXPGPATH, + "%s" RECOVERY_CONF_PREFIX, filename); + + durable_rename(filename, filename_with_original_contents, FATAL); + } + appendPQExpBuffer(recoveryconfcontents, "recovery_target_lsn = '%s'\n", lsn); + + recovery_params_set = true; WriteRecoveryConfig(conn, datadir, recoveryconfcontents); } disconnect_database(conn, false); diff --git a/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl b/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl index 3d6086dc489..c25d5c9822c 100644 --- a/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl +++ b/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl @@ -41,6 +41,30 @@ sub generate_db return $dbname; } +# Check if a parameter is absent from the server's configuration file. +# Uses 'recovery.conf' for PostgreSQL < 12, 'postgresql.auto.conf' for newer +# versions. +sub test_param_absent +{ + my ($node, $param) = @_; + my $version = $node->pg_version->major; + my $conf; + + if ($version < 12) + { + $conf = $node->data_dir . '/recovery.conf'; + } + else + { + $conf = $node->data_dir . '/postgresql.auto.conf'; + } + + return 1 unless -e $conf; + + my $content = slurp_file($conf); + return $content !~ /^\s*$param\s*=/m; +} + # # Test mandatory options command_fails(['pg_createsubscriber'], @@ -160,6 +184,7 @@ primary_slot_name = '$slotname' primary_conninfo = '$pconnstr dbname=postgres' hot_standby_feedback = on ]); +my $sconnstr = $node_s->connstr; $node_s->set_standby_mode(); $node_s->start; @@ -467,6 +492,13 @@ command_ok( ], 'run pg_createsubscriber on node S'); +# Verify that recovery parameters have been reset after pg_createsubscriber +# We check recovery_target_lsn as a representative parameter - since all +# recovery parameters are managed as a group, the absence of one indicates +# that the entire set has been properly cleared from the configuration. +ok( test_param_absent($node_s, 'recovery_target_lsn'), + 'recovery_target_lsn parameter was removed'); + # Confirm the physical replication slot has been removed $result = $node_p->safe_psql($db1, "SELECT count(*) FROM pg_replication_slots WHERE slot_name = '$slotname'" @@ -537,10 +569,42 @@ my $sysid_s = $node_s->safe_psql('postgres', 'SELECT system_identifier FROM pg_control_system()'); isnt($sysid_p, $sysid_s, 'system identifier was changed'); +# Create a physical standby from the promoted subscriber +$node_s->safe_psql('postgres', + "SELECT pg_create_physical_replication_slot('$slotname');"); + +# Create backup from promoted subscriber +$node_s->backup('backup_3'); + +# Initialize new physical standby +my $node_k = PostgreSQL::Test::Cluster->new('node_k'); +$node_k->init_from_backup($node_s, 'backup_3', + has_streaming => 1); + +# Configure the new standby +$node_k->append_conf('postgresql.conf', qq[ +primary_slot_name = '$slotname' +primary_conninfo = '$sconnstr dbname=postgres' +hot_standby_feedback = on +]); + +$node_k->set_standby_mode(); +my $node_k_name = $node_s->name; +command_ok( + [ + 'pg_ctl', '--wait', + '--pgdata' => $node_k->data_dir, + '--log' => $node_k->logfile, + '--options' => "--cluster-name=$node_k_name", + 'start' + ], + "node_k has started"); + # clean up $node_p->teardown_node; $node_s->teardown_node; $node_t->teardown_node; $node_f->teardown_node; +$node_k->teardown_node; done_testing(); -- 2.51.1
