On Mon, Mar 25, 2024, at 1:06 PM, Hayato Kuroda (Fujitsu) wrote:
> ## Analysis for failure 1
> 
> The failure caused by a time lag between walreceiver finishes and 
> pg_is_in_recovery()
> returns true.
> 
> According to the output [1], it seems that the tool failed at 
> wait_for_end_recovery()
> with the message "standby server disconnected from the primary". Also, lines
> "redo done at..." and "terminating walreceiver process due to administrator 
> command"
> meant that walreceiver was requested to shut down by XLogShutdownWalRcv().
> 
> According to the source, we confirm that walreceiver is shut down in
> StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, 
> SharedRecoveryState
> is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return 
> true)
> at the latter part of StartupXLOG().
> 
> So, if there is a delay between FinishWalRecovery() and change the state, the 
> check
> in wait_for_end_recovery() would be failed during the time. Since we allow to 
> miss
> the walreceiver 10 times and it is checked once per second, the failure 
> occurs if
> the time lag is longer than 10 seconds.
> 
> I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS 
> larger,
> but it's not a fundamental solution.

I was expecting that slow hosts might have issues in wait_for_end_recovery().
As you said it took a lot of steps between FinishWalRecovery() (where
walreceiver is shutdown -- XLogShutdownWalRcv) and SharedRecoveryState is set to
RECOVERY_STATE_DONE. If this window takes longer than NUM_CONN_ATTEMPTS *
WAIT_INTERVAL (10 seconds), it aborts the execution. That's a bad decision
because it already finished the promotion and it is just doing the final
preparation for the host to become a primary.

        /*   
         * If it is still in recovery, make sure the target server is
         * connected to the primary so it can receive the required WAL to
         * finish the recovery process. If it is disconnected try
         * NUM_CONN_ATTEMPTS in a row and bail out if not succeed.
         */
        res = PQexec(conn,
                     "SELECT 1 FROM pg_catalog.pg_stat_wal_receiver");
        if (PQntuples(res) == 0)
        {
            if (++count > NUM_CONN_ATTEMPTS)
            {
                stop_standby_server(subscriber_dir);
                pg_log_error("standby server disconnected from the primary");
                break;
            }    
        }
        else
            count = 0;          /* reset counter if it connects again */

This code was add to defend against the death/crash of the target server. There
are at least 3 options:

(1) increase NUM_CONN_ATTEMPTS * WAIT_INTERVAL seconds. We discussed this 
constant
and I decided to use 10 seconds because even in some slow hosts, this time
wasn't reached during my tests. It seems I forgot to test the combination of 
slow
host, asserts enabled, and ubsan. I didn't notice that pg_promote() uses 60
seconds as default wait. Maybe that's a reasonable value. I checked the
004_timeline_switch test and the last run took: 39.2s (serinus), 33.1s
(culicidae), 18.31s (calliphoridae) and 27.52s (olingo).

(2) check if the primary is not running when walreceiver is not available on the
target server. Increase the connection attempts iif the primary is not running.
Hence, the described case doesn't cause an increment on the count variable.

(3) set recovery_timeout default to != 0 and remove pg_stat_wal_receiver check
protection against the death/crash target server. I explained in a previous
message that timeout may occur in cases that WAL replay to reach consistent
state takes more than recovery-timeout seconds.

Option (1) is the easiest fix, however, we can have the same issue again if a
slow host decides to be even slower, hence, we have to adjust this value again.
Option (2) interprets the walreceiver absence as a recovery end and if the
primary server is running it can indicate that the target server is in the
imminence of the recovery end. Option (3) is not as resilient as the other
options.

The first patch implements a combination of (1) and (2).

> ## Analysis for failure 2
> 
> According to [2], the physical replication slot which is specified as 
> primary_slot_name
> was not used by the walsender process. At that time walsender has not existed.
> 
> ```
> ...
> pg_createsubscriber: publisher: current wal senders: 0
> pg_createsubscriber: command is: SELECT 1 FROM 
> pg_catalog.pg_replication_slots WHERE active AND slot_name = 'physical_slot'
> pg_createsubscriber: error: could not obtain replication slot information: 
> got 0 rows, expected 1 row
> ...
> ```
> 
> Currently standby must be stopped before the command and current code does not
> block the flow to ensure the replication is started. So there is a possibility
> that the checking is run before walsender is launched.
> 
> One possible approach is to wait until the replication starts. Alternative 
> one is
> to ease the condition.

That's my suggestion too. I reused NUM_CONN_ATTEMPTS (that was renamed to
NUM_ATTEMPTS in the first patch). See second patch.


--
Euler Taveira
EDB   https://www.enterprisedb.com/
From b2cfa13627bd844b28fe050ffb35e8b9696fdf2e Mon Sep 17 00:00:00 2001
From: Euler Taveira <euler.tave...@enterprisedb.com>
Date: Mon, 25 Mar 2024 22:01:52 -0300
Subject: [PATCH v1 1/2] Improve the code that checks if the recovery is
 finishing

The recovery process has a window between the walreceiver shutdown and
the pg_is_in_recovery function returns false. It means that the
pg_stat_wal_receiver checks can cause the server to finish the recovery
(even if it already reaches the recovery target). Since it checks the
pg_stat_wal_receiver to verify the primary is available, if it does not
return a row, PQping the primary server. If it is up and running, it
can indicate that the target server is finishing the recovery process,
hence, we shouldn't count it as an attempt. It avoids premature failures
on slow hosts.

While on it, increase the number of attempts (10 to 60). The wait time is
the same pg_promote function uses by default.
---
 src/bin/pg_basebackup/pg_createsubscriber.c | 30 +++++++++++++--------
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c
index b8f8269340..cca93d8c25 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -30,6 +30,8 @@
 
 #define	DEFAULT_SUB_PORT	"50432"
 
+#define NUM_ATTEMPTS		60
+
 /* Command-line options */
 struct CreateSubscriberOptions
 {
@@ -93,7 +95,7 @@ static void pg_ctl_status(const char *pg_ctl_cmd, int rc);
 static void start_standby_server(const struct CreateSubscriberOptions *opt,
 								 bool restricted_access);
 static void stop_standby_server(const char *datadir);
-static void wait_for_end_recovery(const char *conninfo,
+static void wait_for_end_recovery(const struct LogicalRepInfo *dbinfo,
 								  const struct CreateSubscriberOptions *opt);
 static void create_publication(PGconn *conn, struct LogicalRepInfo *dbinfo);
 static void drop_publication(PGconn *conn, struct LogicalRepInfo *dbinfo);
@@ -1354,18 +1356,16 @@ stop_standby_server(const char *datadir)
  * the recovery process. By default, it waits forever.
  */
 static void
-wait_for_end_recovery(const char *conninfo, const struct CreateSubscriberOptions *opt)
+wait_for_end_recovery(const struct LogicalRepInfo *dbinfo, const struct CreateSubscriberOptions *opt)
 {
 	PGconn	   *conn;
 	int			status = POSTMASTER_STILL_STARTING;
 	int			timer = 0;
 	int			count = 0;		/* number of consecutive connection attempts */
 
-#define NUM_CONN_ATTEMPTS	10
-
 	pg_log_info("waiting for the target server to reach the consistent state");
 
-	conn = connect_database(conninfo, true);
+	conn = connect_database(dbinfo->subconninfo, true);
 
 	for (;;)
 	{
@@ -1384,16 +1384,24 @@ wait_for_end_recovery(const char *conninfo, const struct CreateSubscriberOptions
 		}
 
 		/*
-		 * If it is still in recovery, make sure the target server is
-		 * connected to the primary so it can receive the required WAL to
-		 * finish the recovery process. If it is disconnected try
-		 * NUM_CONN_ATTEMPTS in a row and bail out if not succeed.
+		 * If it is still in recovery, make sure the target server is connected
+		 * to the primary so it can receive the required WAL to finish the
+		 * recovery process. If the walreceiver process is not running it
+		 * should indicate that (i) the recovery is almost finished or (ii) the
+		 * primary is not running or is not accpeting connections. It should
+		 * count as attempts iif (ii) is true. In this case, try NUM_ATTEMPTS
+		 * in a row and bail out if not succeed.
 		 */
 		res = PQexec(conn,
 					 "SELECT 1 FROM pg_catalog.pg_stat_wal_receiver");
 		if (PQntuples(res) == 0)
 		{
-			if (++count > NUM_CONN_ATTEMPTS)
+			if (PQping(dbinfo->pubconninfo) != PQPING_OK)
+				count++;
+			else
+				count = 0;		/* reset counter if it connects again */
+
+			if (count > NUM_ATTEMPTS)
 			{
 				stop_standby_server(subscriber_dir);
 				pg_log_error("standby server disconnected from the primary");
@@ -2113,7 +2121,7 @@ main(int argc, char **argv)
 	start_standby_server(&opt, true);
 
 	/* Waiting the subscriber to be promoted */
-	wait_for_end_recovery(dbinfo[0].subconninfo, &opt);
+	wait_for_end_recovery(&dbinfo[0], &opt);
 
 	/*
 	 * Create the subscription for each database on subscriber. It does not
-- 
2.30.2

From 872510ed67b91cb7482dea3bce831549d3519e80 Mon Sep 17 00:00:00 2001
From: Euler Taveira <euler.tave...@enterprisedb.com>
Date: Mon, 25 Mar 2024 23:25:30 -0300
Subject: [PATCH v1 2/2] Improve the code that checks if the primary slot is
 available

The target server is started a few instructions before checking the
primary server and the replication slot might not be active. Instead of
failing the first time, try a few times.
---
 src/bin/pg_basebackup/pg_createsubscriber.c | 35 +++++++++++++++------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c
index cca93d8c25..f17e9bde9c 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -887,6 +887,8 @@ check_publisher(const struct LogicalRepInfo *dbinfo)
 	{
 		PQExpBuffer str = createPQExpBuffer();
 		char	   *psn_esc = PQescapeLiteral(conn, primary_slot_name, strlen(primary_slot_name));
+		int			ntuples;
+		int			count = 0;
 
 		appendPQExpBuffer(str,
 						  "SELECT 1 FROM pg_catalog.pg_replication_slots "
@@ -897,25 +899,38 @@ check_publisher(const struct LogicalRepInfo *dbinfo)
 
 		pg_log_debug("command is: %s", str->data);
 
-		res = PQexec(conn, str->data);
-		if (PQresultStatus(res) != PGRES_TUPLES_OK)
+		/*
+		 * The replication slot might take some time to be active, try a few
+		 * times if necessary.
+		 */
+		do
 		{
-			pg_log_error("could not obtain replication slot information: %s",
-						 PQresultErrorMessage(res));
-			disconnect_database(conn, true);
-		}
+			res = PQexec(conn, str->data);
+			if (PQresultStatus(res) != PGRES_TUPLES_OK)
+			{
+				pg_log_error("could not obtain replication slot information: %s",
+							 PQresultErrorMessage(res));
+				disconnect_database(conn, true);
+			}
 
-		if (PQntuples(res) != 1)
+			ntuples = PQntuples(res);
+			PQclear(res);
+
+			if (ntuples == 1)	/* replication slot is already active */
+				break;
+			else
+				count++;
+		} while (count > NUM_ATTEMPTS);
+
+		if (ntuples != 1)
 		{
 			pg_log_error("could not obtain replication slot information: got %d rows, expected %d row",
-						 PQntuples(res), 1);
+						 ntuples, 1);
 			disconnect_database(conn, true);
 		}
 		else
 			pg_log_info("primary has replication slot \"%s\"",
 						primary_slot_name);
-
-		PQclear(res);
 	}
 
 	disconnect_database(conn, false);
-- 
2.30.2

Reply via email to