[PATCH] fix a performance issue with multiple logical-decoding walsenders

Pierre Ducroquet Thu, 26 Dec 2019 08:44:01 -0800

Hello

Our current setup uses logical replication to build a BI replication server 
along our primary clusters (running PG 10.10 so far). This implies having one 
logical replication slot per database. After some analysis, we identified two 
hot spots behind this issue. Fixing them gave us a 10 fold performance 
improvement in decoding speed.



We noticed our load had gotten quite bigger on our primary since the 
introduction of this replication, seeing spikes in system time when a lot of 
wal were being written (for instance when creating GIN indexes).

The first hot spot is PostmasterIsAlive. The implementation reads on a pipe to 
know if the postmaster is still alive, but this is very expensive kernel-wise. 
When switching the implementation to a much more primitive (and probably 
wrong):
        bool PostmasterIsAliveInternal() {
                return getppid() == PostmasterPid;
        }
we stopped seeing spikes in system time.
But after doing that, the CPU time used by our walsenders increased. We 
reached a second hot spot, this time in XLogSendLogical, where each walsender 
was using 100% of user CPU for minutes. After checking with perf, it appears 
all the decoders are fighting on GetFlushRecPtr.

On PostgreSQL 12, the call to PostmasterIsAlive is no longer present in 
WalSndLoop (thanks to commit cfdf4dc4f), so only the second hot spot is still 
present, with the same effects.

Attached to this email are two patches. 

The first one, specific to PG 10 for our use-case, simplifies the 
PostmasterIsAlive function, as described above. I don't know if this 
implementation is valid, but it was needed to uncover the performance issue in 
XLogSendLogical. Would it be possible to remove the systematic call to 
PostmasterIsAlive in WalSndLoop? We are not certain of the behaviour.

The second one was tested on PG 10 and PG 12 (with 48 lines offset). It has on 
PG12 the same effect it has on a PG10+isAlive patch. Instead of calling each 
time GetFlushRecPtr, we call it only if we notice we have reached the value of 
the previous call. This way, when the senders are busy decoding, we are no 
longer fighting for a spinlock to read the FlushRecPtr.

Here are some benchmark results.
On PG 10, to decode our replication stream, we went from 3m 43s to over 5 
minutes after removing the first hot spot, and then down to 22 seconds.
On PG 12, we had to change the benchmark (due to GIN indexes creation being 
more optimized) so we can not compare directly with our previous bench. We 
went from 15m 11s down to 59 seconds.
If needed, we can provide scripts to reproduce this situation. It is quite 
simple: add ~20 walsenders doing logical replication in database A, and then 
generate a lot of data in database B. The walsenders will be woken up by the 
activity on database B, but not sending it thus keeping hitting the same 
locks.

Regards

diff --git a/src/backend/storage/ipc/pmsignal.c b/src/backend/storage/ipc/pmsignal.c
index 85db6b2..7a76687 100644
--- a/src/backend/storage/ipc/pmsignal.c
+++ b/src/backend/storage/ipc/pmsignal.c
@@ -272,6 +272,13 @@ bool
 PostmasterIsAlive(void)
 {
 #ifndef WIN32
+
+	if (getppid() != PostmasterPid)
+		return false;
+	else
+		return true;
+
+/*
 	char		c;
 	ssize_t		rc;
 
@@ -287,6 +294,7 @@ PostmasterIsAlive(void)
 		elog(FATAL, "unexpected data in postmaster death monitoring pipe");
 
 	return false;
+*/
 #else							/* WIN32 */
 	return (WaitForSingleObject(PostmasterHandle, 0) == WAIT_TIMEOUT);
 #endif							/* WIN32 */

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 74330f8..8dd327e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2765,7 +2765,7 @@ XLogSendLogical(void)
 {
 	XLogRecord *record;
 	char	   *errm;
-	XLogRecPtr	flushPtr;
+	static XLogRecPtr flushPtr = 0;
 
 	/*
 	 * Don't know whether we've caught up yet. We'll set WalSndCaughtUp to
@@ -2782,11 +2782,6 @@ XLogSendLogical(void)
 	if (errm != NULL)
 		elog(ERROR, "%s", errm);
 
-	/*
-	 * We'll use the current flush point to determine whether we've caught up.
-	 */
-	flushPtr = GetFlushRecPtr();
-
 	if (record != NULL)
 	{
 		/*
@@ -2801,7 +2796,14 @@ XLogSendLogical(void)
 
 	/* Set flag if we're caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		WalSndCaughtUp = true;
+	{
+		/*
+		 * We'll use the current flush point to determine whether we've caught up.
+		 */
+		flushPtr = GetFlushRecPtr();
+		if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+			WalSndCaughtUp = true;
+	}
 
 	/*
 	 * If we're caught up and have been requested to stop, have WalSndLoop()

[PATCH] fix a performance issue with multiple logical-decoding walsenders

Reply via email to