An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

Bharath Rupireddy Mon, 25 Apr 2022 07:21:31 -0700

Hi,

With synchronous replication typically all the transactions (txns)
first locally get committed, then streamed to the sync standbys and
the backend that generated the transaction will wait for ack from sync
standbys. While waiting for ack, it may happen that the query or the
txn gets canceled (QueryCancelPending is true) or the waiting backend
is asked to exit (ProcDiePending is true). In either of these cases,
the wait for ack gets canceled and leaves the txn in an inconsistent
state (as in the client thinks that the txn would have replicated to
sync standbys) - "The transaction has already committed locally, but
might not have been replicated to the standby.". Upon restart after
the crash or in the next txn after the old locally committed txn was
canceled, the client will be able to see the txns that weren't
actually streamed to sync standbys. Also, if the client fails over to
one of the sync standbys after the crash (either by choice or because
of automatic failover management after crash), the locally committed
txns on the crashed primary would be lost which isn't good in a true
HA solution.

Here's a proposal (mentioned previously by Satya [1]) to avoid the
above problems:
1) Wait a configurable amount of time before canceling the sync
replication by the backends i.e. delay processing of
QueryCancelPending and ProcDiePending in Introduced a new timeout GUC
synchronous_replication_naptime_before_cancel, when set, it will let
the backends wait for the ack before canceling the synchronous
replication so that the transaction can be available in sync standbys
as well. If the ack isn't received even within this time frame, the
backend cancels the wait and goes ahead as it does today. In
production HA environments, the GUC can be set to a reasonable value
to avoid missing transactions during failovers.
2) Wait for sync standbys to catch up upon restart after the crash or
in the next txn after the old locally committed txn was canceled. One
way to achieve this is to let the backend, that's making the first
connection, wait for sync standbys to catch up in ClientAuthentication
right after successful authentication. However, I'm not sure this is
the best way to do it at this point.

Thoughts?

Here's a WIP patch implementing the (1), I'm yet to code for (2). I
haven't added tests, I'm yet to figure out how to add one as there's
no way we can delay the WAL sender so that we can reliably hit this
code. I will think more about this.

[1]
https://www.postgresql.org/message-id/CAHg%2BQDdTdPsqtu0QLG8rMg3Xo%3D6Xo23TwHPYsUgGNEK13wTY5g%40mail.gmail.com

Regards,
Bharath Rupireddy.

From d5fe07bbd80b72dfbf06e9b039b9e4a93a7f7a06 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sun, 24 Apr 2022 03:42:59 +0000
Subject: [PATCH v1] Wait specified amount of time before cancelling sync
 replication

In PostgreSQL high availability setup with synchronous replication,
typically all the transactions first locally get committed, then
streamed to the synchronous standbys and the backends that generated
the transaction will wait for acknowledgement from synchronous
standbys. While waiting for acknowledgement, it may happen that the
query or the transaction gets canceled or the backend that's waiting
for acknowledgement is asked to exit. In either of these cases, the
wait for acknowledgement gets canceled and leaves transaction in an
inconsistent state as it got committed locally but not on the
standbys. When set the GUC synchronous_replication_naptime_before_cancel
introduced in this patch, it will let the backends wait for the
acknowledgement before canceling the wait for acknowledgement so
that the transaction can be available in synchronous standbys as
well.
---
 doc/src/sgml/config.sgml                      | 30 +++++++++++
 src/backend/replication/syncrep.c             | 50 +++++++++++++++++++
 src/backend/utils/misc/guc.c                  | 12 +++++
 src/backend/utils/misc/postgresql.conf.sample |  2 +
 src/include/replication/syncrep.h             |  3 ++
 5 files changed, 97 insertions(+)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 03986946a8..1681ea173f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4507,6 +4507,36 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-synchronous-replication-naptime-before-cancel" xreflabel="synchronous_replication_naptime_before_cancel">
+      <term><varname>synchronous_replication_naptime_before_cancel</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>synchronous_replication_naptime_before_cancel</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the amount of time in milliseconds to wait for synchronous
+        replication before cancelling. Default value is 0, a value of -1 or 0
+        disables this feature. In <productname>PostgreSQL</productname> high
+        availability setup with synchronous replication, typically all the
+        transactions first locally get committed, then streamed to the
+        synchronous standbys and the backends that generated the transaction
+        will wait for acknowledgement from synchronous standbys. While waiting
+        for acknowledgement, it may happen that the query or the transaction
+        gets canceled or the backend that's waiting for acknowledgement is
+        asked to exit. In either of these cases, the wait for acknowledgement
+        gets canceled and leaves transaction in an inconsistent state as it got
+        committed locally but not on the standbys. When set the
+        <varname>synchronous_replication_naptime_before_cancel</varname>
+        parameter, it will let the backends wait for the acknowledgement
+        before canceling the wait for acknowledgement so that the transaction
+        can be available in synchronous standbys as well. This parameter can
+        only be set in the <filename>postgresql.conf</filename> file or on the
+        server command line.
+        </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ce163b99e9..0f54d81f2b 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -88,6 +88,7 @@
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
+int			SyncRepNapTimeBeforeCancel = 0;
 
 #define SyncStandbysDefined() \
 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
@@ -119,6 +120,7 @@ static void SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr,
 static int	SyncRepGetStandbyPriority(void);
 static int	standby_priority_comparator(const void *a, const void *b);
 static int	cmp_lsn(const void *a, const void *b);
+static bool SyncRepNapBeforeCancel(void);
 
 #ifdef USE_ASSERT_CHECKING
 static bool SyncRepQueueIsOrderedByLSN(int mode);
@@ -130,6 +132,42 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  * ===========================================================
  */
 
+/*
+ * Wait for synchronous replication before cancelling, if requested by user.
+ */
+static bool
+SyncRepNapBeforeCancel(void)
+{
+	int wait_time;
+
+	if (SyncRepNapTimeBeforeCancel <= 0)
+		return false;
+
+	ereport(WARNING,
+			(errmsg_plural("waiting %d millisecond for synchronous replication before cancelling",
+						   "waiting %d milliseconds for synchronous replication before cancelling",
+							SyncRepNapTimeBeforeCancel,
+							SyncRepNapTimeBeforeCancel)));
+
+	wait_time = SyncRepNapTimeBeforeCancel;
+
+	while (wait_time > 0)
+	{
+		/*
+		 * Wait in intervals of 1 millisecond so that we can frequently check
+		 * for the acknowledgement.
+		 */
+		pg_usleep(1 * 1000L);
+
+		wait_time--;
+
+		if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
+			return true;
+	}
+
+	return true;
+}
+
 /*
  * Wait for synchronous replication, if requested by user.
  *
@@ -263,6 +301,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 		 */
 		if (ProcDiePending)
 		{
+			if (SyncRepNapBeforeCancel())
+			{
+				if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
+					break;
+			}
+
 			ereport(WARNING,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
@@ -280,6 +324,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 		 */
 		if (QueryCancelPending)
 		{
+			if (SyncRepNapBeforeCancel())
+			{
+				if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
+					break;
+			}
+
 			QueryCancelPending = false;
 			ereport(WARNING,
 					(errmsg("canceling wait for synchronous replication due to user request"),
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8e9b71375c..547bc2727f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2743,6 +2743,18 @@ static struct config_int ConfigureNamesInt[] =
 		0, 0, 1000000,			/* see ComputeXidHorizons */
 		NULL, NULL, NULL
 	},
+
+	{
+		{"synchronous_replication_naptime_before_cancel", PGC_SIGHUP, REPLICATION_PRIMARY,
+			gettext_noop("Sets the amount of time to wait for synchronous replictaion before cancelling."),
+			gettext_noop("A value of -1 or 0 disables this feature."),
+			GUC_UNIT_MS
+		},
+		&SyncRepNapTimeBeforeCancel,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_failsafe_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Age at which VACUUM should trigger failsafe to avoid a wraparound outage."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 94270eb0ec..4fd4d04804 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -324,6 +324,8 @@
 				# and comma-separated list of application_name
 				# from standby(s); '*' = all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
+#synchronous_replication_naptime_before_cancel = 0 # amount of time to wait for
+				# synchronous replictaion before cancelling; 0 or -1 disables
 
 # - Standby Servers -
 
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index 4d7c90b9f0..6678f14b93 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -81,6 +81,9 @@ extern PGDLLIMPORT char *syncrep_parse_error_msg;
 /* user-settable parameters for synchronous replication */
 extern PGDLLIMPORT char *SyncRepStandbyNames;
 
+/* user-settable nap time for synchronous replictaion before cancelling */
+extern PGDLLIMPORT int SyncRepNapTimeBeforeCancel;
+
 /* called by user backend */
 extern void SyncRepWaitForLSN(XLogRecPtr lsn, bool commit);
 
-- 
2.25.1

An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

Reply via email to