On Thu, Sep 24, 2020 at 2:39 AM Fujii Masao <masao.fu...@oss.nttdata.com> wrote:
> Does this patch work fine with warm-standby case using pg_standby?
> IIUC the startup process doesn't call WaitLatch() in that case, so ISTM that,
> with the patch, it cannot detect the postmaster death immediately.

Right, RestoreArchivedFile() uses system(), so I guess it can hang
around for a long time after unexpected postmaster exit on every OS if
the command waits.  To respond to various kinds of important
interrupts, I suppose that'd ideally use something like
OpenPipeStream() and a typical WaitLatch() loop with CFI().  I'm not
sure what our policy is or should be for exiting while we have running
subprocesses.  I guess that is a separate issue.

Here's a rebase, no code change.
From 2cd588ecc72729930a06b20da65537dfcb8b2f52 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Thu, 24 Sep 2020 17:37:54 +1200
Subject: [PATCH v4] Poll postmaster less frequently in recovery.

Since commits 9f095299 and f98b8476 we don't poll the postmaster
pipe at all during crash recovery on Linux and FreeBSD, but on other
operating systems we were still doing it for every WAL record.  Do it
less frequently on operating systems where system calls are required, at
the cost of delaying exit a bit after postmaster death, to avoid
expensive system calls reported to slow down CPU-bound recovery by
as much as 10-30%.

Replace a couple of pg_usleep()-based wait loops with WaitLatch() loops,
to make sure they respond promptly to postmaster death now that
HandleStartupProcInterrupts() doesn't check every time through on some
systems.

Reviewed-by: Heikki Linnakangas <hlinn...@iki.fi>
Reviewed-by: Fujii Masao <masao.fu...@oss.nttdata.com>
Discussion: https://postgr.es/m/CA%2BhUKGK1607VmtrDUHQXrsooU%3Dap4g4R2yaoByWOOA3m8xevUQ%40mail.gmail.com
Discussion: https://postgr.es/m/7261eb39-0369-f2f4-1bb5-62f3b6083...@iki.fi
---
 doc/src/sgml/monitoring.sgml               |  4 ++++
 src/backend/access/transam/xlog.c          |  7 +++----
 src/backend/postmaster/pgstat.c            |  3 +++
 src/backend/postmaster/startup.c           | 16 ++++++++++++++--
 src/backend/replication/walreceiverfuncs.c |  6 ++++--
 src/include/pgstat.h                       |  1 +
 6 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4e0193a967..65210ee064 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1734,6 +1734,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting for confirmation from a remote server during synchronous
        replication.</entry>
      </row>
+     <row>
+      <entry><literal>WalrcvExit</literal></entry>
+      <entry>Waiting for the walreceiver to exit.</entry>
+     </row>
      <row>
       <entry><literal>XactGroupUpdate</literal></entry>
       <entry>Waiting for the group leader to update transaction status at
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 61754312e2..f9d9b38a8a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5999,7 +5999,7 @@ recoveryStopsAfter(XLogReaderState *record)
  * the paused state starts at the end of recovery because of
  * recovery_target_action=pause, and false otherwise.
  *
- * XXX Could also be done with shared latch, avoiding the pg_usleep loop.
+ * XXX Could also be done with shared latch, avoiding the WL_TIMEOUT loop.
  * Probably not worth the trouble though.  This state shouldn't be one that
  * anyone cares about server power consumption in.
  */
@@ -6028,9 +6028,8 @@ recoveryPausesHere(bool endOfRecovery)
 		HandleStartupProcInterrupts();
 		if (CheckForStandbyTrigger())
 			return;
-		pgstat_report_wait_start(WAIT_EVENT_RECOVERY_PAUSE);
-		pg_usleep(1000000L);	/* 1000 ms */
-		pgstat_report_wait_end();
+		(void) WaitLatch(NULL, WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, 1000,
+						 WAIT_EVENT_RECOVERY_PAUSE);
 	}
 }
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e6be2b7836..395d61f082 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3875,6 +3875,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_WALRCV_EXIT:
+			event_name = "WalrcvExit";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index 64af7b8707..a802041ca7 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -134,6 +134,10 @@ StartupRereadConfig(void)
 void
 HandleStartupProcInterrupts(void)
 {
+#ifndef USE_POSTMASTER_DEATH_SIGNAL
+	static int	count = 0;
+#endif
+
 	/*
 	 * Process any requests or signals received recently.
 	 */
@@ -151,9 +155,17 @@ HandleStartupProcInterrupts(void)
 
 	/*
 	 * Emergency bailout if postmaster has died.  This is to avoid the
-	 * necessity for manual cleanup of all postmaster children.
+	 * necessity for manual cleanup of all postmaster children.  Do this less
+	 * frequently on systems for which we don't have signals to make that
+	 * cheap.  Any loop that sleeps should be using WaitLatch or similar and
+	 * handling postmaster death that way; the check here is intended only to
+	 * deal with CPU-bound loops such as the main redo loop.
 	 */
-	if (IsUnderPostmaster && !PostmasterIsAlive())
+	if (IsUnderPostmaster &&
+#ifndef USE_POSTMASTER_DEATH_SIGNAL
+		count++ % 1024 == 0 &&
+#endif
+		!PostmasterIsAlive())
 		exit(1);
 
 	/* Process barrier events */
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index e675757301..059353f342 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -23,8 +23,10 @@
 #include <signal.h>
 
 #include "access/xlog_internal.h"
+#include "pgstat.h"
 #include "postmaster/startup.h"
 #include "replication/walreceiver.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
 #include "utils/timestamp.h"
@@ -207,8 +209,8 @@ ShutdownWalRcv(void)
 		 * process.
 		 */
 		HandleStartupProcInterrupts();
-
-		pg_usleep(100000);		/* 100ms */
+		(void) WaitLatch(NULL, WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, 100,
+						 WAIT_EVENT_WALRCV_EXIT);
 	}
 }
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0dfbac46b4..f934acaed9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -887,6 +887,7 @@ typedef enum
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_WALRCV_EXIT,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.20.1

Reply via email to