On Thu, Sep 17, 2020 at 10:47 PM Heikki Linnakangas <hlinn...@iki.fi> wrote:
> Hmm, so for speedy response to postmaster death, you're relying on the
> loops to have other postmaster-death checks besides
> HandleStartupProcInterrupts(), in the form of WL_EXIT_ON_PM_DEATH. That
> seems a bit fragile, at the very least it needs a comment in
> HandleStartupProcInterrupts() to call it out.

Surely that's the direction we want to go in, though, no?  Commit
cfdf4dc4 was intended to standardise the way we react to postmaster
death where waiting is involved.  I updated the comment in
HandleStartupProcInterrupts() to highlight that the
PostmasterIsAlive() check in there is only for the benefit of
CPU-bound loops.

> Note that there's one more loop in ShutdownWalRcv() that uses pg_usleep().

Updating that one required me to invent a new wait_event for
pg_stat_activity, which seems like progress.

Unfortunately, while I was doing that I realised that WaitLatch()
without WL_SET_LATCH was broken by commit 3347c982bab (in master
only), in a way not previously reached by the tests.  So 0001 is a
patch to fix that.
From 2afa37048fe659bb1f1b3c8aec89824b2f00a892 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Fri, 18 Sep 2020 00:04:56 +1200
Subject: [PATCH v2 1/3] Allow WaitLatch() to be used without a latch.

Due to flaws in commit 3347c982bab, using WaitLatch() without
WL_SET_LATCH could cause an assertion failure or crash.  Repair.

While here, also add a check that the latch we're switching to belongs
to this backend.

Discussion: https://postgr.es/m/CA%2BhUKGK1607VmtrDUHQXrsooU%3Dap4g4R2yaoByWOOA3m8xevUQ%40mail.gmail.com
---
 src/backend/storage/ipc/latch.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 4153cc8557..63c6c97536 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -924,7 +924,22 @@ ModifyWaitEvent(WaitEventSet *set, int pos, uint32 events, Latch *latch)
 
 	if (events == WL_LATCH_SET)
 	{
+		if (latch && latch->owner_pid != MyProcPid)
+			elog(ERROR, "cannot wait on a latch owned by another process");
 		set->latch = latch;
+		/*
+		 * On Unix, we don't need to modify the kernel object because the
+		 * underlying pipe is the same for all latches so we can return
+		 * immediately.  On Windows, we need to update our array of handles,
+		 * but we leave the old one in place and tolerate spurious wakeups if
+		 * the latch is disabled.
+		 */
+#if defined(WAIT_USE_WIN32)
+		if (!latch)
+			return;
+#else
+		return;
+#endif
 	}
 
 #if defined(WAIT_USE_EPOLL)
@@ -1386,7 +1401,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			/* There's data in the self-pipe, clear it. */
 			drainSelfPipe();
 
-			if (set->latch->is_set)
+			if (set->latch && set->latch->is_set)
 			{
 				occurred_events->fd = PGINVALID_SOCKET;
 				occurred_events->events = WL_LATCH_SET;
@@ -1536,7 +1551,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			/* There's data in the self-pipe, clear it. */
 			drainSelfPipe();
 
-			if (set->latch->is_set)
+			if (set->latch && set->latch->is_set)
 			{
 				occurred_events->fd = PGINVALID_SOCKET;
 				occurred_events->events = WL_LATCH_SET;
@@ -1645,7 +1660,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			/* There's data in the self-pipe, clear it. */
 			drainSelfPipe();
 
-			if (set->latch->is_set)
+			if (set->latch && set->latch->is_set)
 			{
 				occurred_events->fd = PGINVALID_SOCKET;
 				occurred_events->events = WL_LATCH_SET;
@@ -1812,7 +1827,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 		if (!ResetEvent(set->latch->event))
 			elog(ERROR, "ResetEvent failed: error code %lu", GetLastError());
 
-		if (set->latch->is_set)
+		if (set->latch && set->latch->is_set)
 		{
 			occurred_events->fd = PGINVALID_SOCKET;
 			occurred_events->events = WL_LATCH_SET;
-- 
2.20.1

From 5d62d3091f0980299f1471f0a8006c0dd0bea648 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Thu, 17 Sep 2020 21:11:01 +1200
Subject: [PATCH v2 2/3] Poll postmaster less frequently in recovery.

Since commits 9f095299 and f98b8476 we don't poll the postmaster
pipe at all during crash recovery on Linux and FreeBSD, but on other
operating systems we were still doing it for every WAL record.  Do it
less frequently on operating systems where system calls are required, at
the cost of delaying exit a bit after postmaster death, to avoid
expensive system calls reported to slow down CPU-bound recovery by
10-30%.

Replace a couple of pg_usleep()-based wait loops with WaitLatch() loops,
to make sure they respond promptly to postmaster death now that
HandleStartupProcInterrupts() doesn't check every time through on some
systems.

Discussion: https://postgr.es/m/CA%2BhUKGK1607VmtrDUHQXrsooU%3Dap4g4R2yaoByWOOA3m8xevUQ%40mail.gmail.com
Discussion: https://postgr.es/m/7261eb39-0369-f2f4-1bb5-62f3b6083...@iki.fi
Reviewed-by: Heikki Linnakangas <hlinn...@iki.fi>
---
 doc/src/sgml/monitoring.sgml               |  4 ++++
 src/backend/access/transam/xlog.c          |  5 ++---
 src/backend/postmaster/pgstat.c            |  3 +++
 src/backend/postmaster/startup.c           | 16 ++++++++++++++--
 src/backend/replication/walreceiverfuncs.c |  6 ++++--
 src/include/pgstat.h                       |  1 +
 6 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 673a0e73e4..b0a8fc1582 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1734,6 +1734,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting for confirmation from a remote server during synchronous
        replication.</entry>
      </row>
+     <row>
+      <entry><literal>WalrcvExit</literal></entry>
+      <entry>Waiting for the walreceiver to exit.</entry>
+     </row>
      <row>
       <entry><literal>XactGroupUpdate</literal></entry>
       <entry>Waiting for the group leader to update transaction status at
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a38371a64f..578006fa32 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6027,9 +6027,8 @@ recoveryPausesHere(bool endOfRecovery)
 		HandleStartupProcInterrupts();
 		if (CheckForStandbyTrigger())
 			return;
-		pgstat_report_wait_start(WAIT_EVENT_RECOVERY_PAUSE);
-		pg_usleep(1000000L);	/* 1000 ms */
-		pgstat_report_wait_end();
+		WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 1000,
+				  WAIT_EVENT_RECOVERY_PAUSE);
 	}
 }
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e6be2b7836..395d61f082 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3875,6 +3875,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_WALRCV_EXIT:
+			event_name = "WalrcvExit";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index fd9ac35dac..330b361a3e 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -134,6 +134,10 @@ StartupRereadConfig(void)
 void
 HandleStartupProcInterrupts(void)
 {
+#ifndef USE_POSTMASTER_DEATH_SIGNAL
+	static int	count = 0;
+#endif
+
 	/*
 	 * Process any requests or signals received recently.
 	 */
@@ -151,9 +155,17 @@ HandleStartupProcInterrupts(void)
 
 	/*
 	 * Emergency bailout if postmaster has died.  This is to avoid the
-	 * necessity for manual cleanup of all postmaster children.
+	 * necessity for manual cleanup of all postmaster children.  Do this less
+	 * frequently on systems for which we don't have signals to make that
+	 * cheap.  Any loop that sleeps should be using WaitLatch or similar and
+	 * handling postmaster death that way; the check here is intended only to
+	 * deal with CPU-bound loops such as the main redo loop.
 	 */
-	if (IsUnderPostmaster && !PostmasterIsAlive())
+	if (IsUnderPostmaster &&
+#ifndef USE_POSTMASTER_DEATH_SIGNAL
+		count++ % 1024 == 0 &&
+#endif
+		!PostmasterIsAlive())
 		exit(1);
 
 	/* Process barrier events */
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index e675757301..d0263ba16e 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -23,8 +23,10 @@
 #include <signal.h>
 
 #include "access/xlog_internal.h"
+#include "pgstat.h"
 #include "postmaster/startup.h"
 #include "replication/walreceiver.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
 #include "utils/timestamp.h"
@@ -207,8 +209,8 @@ ShutdownWalRcv(void)
 		 * process.
 		 */
 		HandleStartupProcInterrupts();
-
-		pg_usleep(100000);		/* 100ms */
+		WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 100,
+				  WAIT_EVENT_WALRCV_EXIT);
 	}
 }
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0dfbac46b4..f934acaed9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -887,6 +887,7 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_WALRCV_EXIT,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.20.1

Reply via email to