Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

Bharath Rupireddy Tue, 27 Sep 2022 06:22:47 -0700

On Tue, Aug 9, 2022 at 2:16 PM Bharath Rupireddy
<bharath.rupireddyforpostg...@gmail.com> wrote:
>
> I've explained the problem with the current HA setup with synchronous
> replication upthread at [1]. Let me reiterate it here once again.
>
> When a query is cancelled (a simple stroke of CTRL+C or
> pg_cancel_backend() call) while the txn is waiting for ack in
> SyncRepWaitForLSN(); for the client, the txn is actually committed
> (locally-committed-but-not-yet-replicated to all of sync standbys)
> like any other txn, a warning is emitted into server logs but it is of
> no use for the client (think of client as applications). There can be
> many such txns waiting for ack in SyncRepWaitForLSN() and query cancel
> can be issued on all of those sessions. The problem is that the
> subsequent reads will then be able to read all of those
> locally-committed-but-not-yet-replicated to all of sync standbys txns
> data - this is what I call an inconsistency (can we call this a
> read-after-write inconsistency?) because of lack of proper query
> cancel handling. And if the sync standbys are down or unable to come
> up for some reason, until then, the primary will be serving clients
> with the inconsistent data. BTW, I found a report of this problem here
> [2].
>
> The solution proposed for the above problem is to just 'not honor
> query cancels at all while waiting for ack in SyncRepWaitForLSN()'.
>
> When a proc die is pending, then also, there can be
> locally-committed-but-not-yet-replicated to all of sync standbys txns.
> Typically, there are two choices for the clients 1) reuse the primary
> instance after restart 2) failover to one of sync standbys. For case
> (1), there might be read-after-write inconsistency as explained above.
> For case (2), those txns might get lost completely if the failover
> target sync standby or the new primary didn't receive them and the
> other sync standbys that have received them are now ahead and need a
> special treatment (run pg_rewind) for them to be able to connect to
> new primary.
>
> The solution proposed for case (1) of the above problem is to 'process
> the ProcDiePending immediately and upon restart the first backend can
> wait until the sync standbys are caught up to ensure no inconsistent
> reads'.
> The solution proposed for case (2) of the above problem is to 'either
> run pg_rewind for the sync standbys that are ahead or use the idea
> proposed at [3]'.
>
> I hope the above explanation helps.
>
> [1] 
> https://www.postgresql.org/message-id/flat/CALj2ACUrOB59QaE6%3DjF2cFAyv1MR7fzD8tr4YM5%2BOwEYG1SNzA%40mail.gmail.com
> [2] 
> https://stackoverflow.com/questions/42686097/how-to-disable-uncommited-reads-in-postgres-synchronous-replication
> [3] 
> https://www.postgresql.org/message-id/CALj2ACX-xO-ZenQt1MWazj0Z3ziSXBMr24N_X2c0dYysPQghrw%40mail.gmail.com


I'm attaching the v2 patch rebased on the latest HEAD. Please note
that there are still some open points, I'm yet to find time to think
more about them. Meanwhile, I'm posting the v2 patch for making cfbot
happy. Any further thoughts on the overall design of the patch are
most welcome. Thanks.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From be6734c83d72333cc4ec1b2be1f4d54b875b74c8 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 27 Sep 2022 12:55:34 +0000
Subject: [PATCH v2] Wait specified amount of time before cancelling sync
 replication

In PostgreSQL high availability setup with synchronous replication,
typically all the transactions first locally get committed, then
streamed to the synchronous standbys and the backends that generated
the transaction will wait for acknowledgement from synchronous
standbys. While waiting for acknowledgement, it may happen that the
query or the transaction gets canceled or the backend that's waiting
for acknowledgement is asked to exit. In either of these cases, the
wait for acknowledgement gets canceled and leaves transaction in an
inconsistent state as it got committed locally but not on the
standbys. When set the GUC synchronous_replication_naptime_before_cancel
introduced in this patch, it will let the backends wait for the
acknowledgement before canceling the wait for acknowledgement so
that the transaction can be available in synchronous standbys as
well.
---
 doc/src/sgml/config.sgml                      | 30 +++++++++++
 src/backend/replication/syncrep.c             | 50 +++++++++++++++++++
 src/backend/utils/misc/guc_tables.c           | 12 +++++
 src/backend/utils/misc/postgresql.conf.sample |  2 +
 src/include/replication/syncrep.h             |  3 ++
 5 files changed, 97 insertions(+)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d8848bc774..baeef49012 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4524,6 +4524,36 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-synchronous-replication-naptime-before-cancel" xreflabel="synchronous_replication_naptime_before_cancel">
+      <term><varname>synchronous_replication_naptime_before_cancel</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>synchronous_replication_naptime_before_cancel</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the amount of time in milliseconds to wait for synchronous
+        replication before cancelling. Default value is 0, a value of -1 or 0
+        disables this feature. In <productname>PostgreSQL</productname> high
+        availability setup with synchronous replication, typically all the
+        transactions first locally get committed, then streamed to the
+        synchronous standbys and the backends that generated the transaction
+        will wait for acknowledgement from synchronous standbys. While waiting
+        for acknowledgement, it may happen that the query or the transaction
+        gets canceled or the backend that's waiting for acknowledgement is
+        asked to exit. In either of these cases, the wait for acknowledgement
+        gets canceled and leaves transaction in an inconsistent state as it got
+        committed locally but not on the standbys. When set the
+        <varname>synchronous_replication_naptime_before_cancel</varname>
+        parameter, it will let the backends wait for the acknowledgement
+        before canceling the wait for acknowledgement so that the transaction
+        can be available in synchronous standbys as well. This parameter can
+        only be set in the <filename>postgresql.conf</filename> file or on the
+        server command line.
+        </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index e360d925b0..60b6a5e471 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -89,6 +89,7 @@
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
+int			SyncRepNapTimeBeforeCancel = 0;
 
 #define SyncStandbysDefined() \
 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
@@ -120,6 +121,7 @@ static void SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr,
 static int	SyncRepGetStandbyPriority(void);
 static int	standby_priority_comparator(const void *a, const void *b);
 static int	cmp_lsn(const void *a, const void *b);
+static bool SyncRepNapBeforeCancel(void);
 
 #ifdef USE_ASSERT_CHECKING
 static bool SyncRepQueueIsOrderedByLSN(int mode);
@@ -131,6 +133,42 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  * ===========================================================
  */
 
+/*
+ * Wait for synchronous replication before cancelling, if requested by user.
+ */
+static bool
+SyncRepNapBeforeCancel(void)
+{
+	int wait_time;
+
+	if (SyncRepNapTimeBeforeCancel <= 0)
+		return false;
+
+	ereport(WARNING,
+			(errmsg_plural("waiting %d millisecond for synchronous replication before cancelling",
+						   "waiting %d milliseconds for synchronous replication before cancelling",
+							SyncRepNapTimeBeforeCancel,
+							SyncRepNapTimeBeforeCancel)));
+
+	wait_time = SyncRepNapTimeBeforeCancel;
+
+	while (wait_time > 0)
+	{
+		/*
+		 * Wait in intervals of 1 millisecond so that we can frequently check
+		 * for the acknowledgement.
+		 */
+		pg_usleep(1 * 1000L);
+
+		wait_time--;
+
+		if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
+			return true;
+	}
+
+	return true;
+}
+
 /*
  * Wait for synchronous replication, if requested by user.
  *
@@ -264,6 +302,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 		 */
 		if (ProcDiePending)
 		{
+			if (SyncRepNapBeforeCancel())
+			{
+				if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
+					break;
+			}
+
 			ereport(WARNING,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
@@ -281,6 +325,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 		 */
 		if (QueryCancelPending)
 		{
+			if (SyncRepNapBeforeCancel())
+			{
+				if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
+					break;
+			}
+
 			QueryCancelPending = false;
 			ereport(WARNING,
 					(errmsg("canceling wait for synchronous replication due to user request"),
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index ab3140ff3a..bb14e47c45 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2496,6 +2496,18 @@ struct config_int ConfigureNamesInt[] =
 		0, 0, 1000000,			/* see ComputeXidHorizons */
 		NULL, NULL, NULL
 	},
+
+	{
+		{"synchronous_replication_naptime_before_cancel", PGC_SIGHUP, REPLICATION_PRIMARY,
+			gettext_noop("Sets the amount of time to wait for synchronous replictaion before cancelling."),
+			gettext_noop("A value of -1 or 0 disables this feature."),
+			GUC_UNIT_MS
+		},
+		&SyncRepNapTimeBeforeCancel,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_failsafe_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Age at which VACUUM should trigger failsafe to avoid a wraparound outage."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2ae76e5cfb..7849e3c5b3 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -324,6 +324,8 @@
 				# and comma-separated list of application_name
 				# from standby(s); '*' = all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
+#synchronous_replication_naptime_before_cancel = 0 # amount of time to wait for
+				# synchronous replictaion before cancelling; 0 or -1 disables
 
 # - Standby Servers -
 
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index 0f3b3a7955..9b668e9a61 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -80,6 +80,9 @@ extern PGDLLIMPORT char *syncrep_parse_error_msg;
 /* user-settable parameters for synchronous replication */
 extern PGDLLIMPORT char *SyncRepStandbyNames;
 
+/* user-settable nap time for synchronous replictaion before cancelling */
+extern PGDLLIMPORT int SyncRepNapTimeBeforeCancel;
+
 /* called by user backend */
 extern void SyncRepWaitForLSN(XLogRecPtr lsn, bool commit);
 
-- 
2.34.1

Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

Reply via email to