On 2021/03/23 11:40, Fujii Masao wrote:
> 
> 
> On 2021/03/23 9:05, Masahiro Ikeda wrote:
>> Yes. I attached the v5 patch based on v3 patch.
>> I renamed SignalHandlerForUnsafeExit() and fixed the following comment.
> 
> Thanks for updating the patch!
> 
> When the startup process exits because of recovery_target_action=shutdown,
> reaper() calls TerminateChildren(SIGTERM). This function sends SIGTERM to
> the stats collector. Currently the stats collector ignores SIGTERM, but with
> the patch it exits normally. This change of behavior might be problematic.
> 
> That is, TerminateChildren(SIGTERM) sends SIGTERM to various processes.
> But currently the stats collector and checkpointer don't exit even when
> SIGTERM arrives because they ignore SIGTERM. After several processes
> other than the stats collector and checkpointer exit by SIGTERM,
> PostmasterStateMachine() and reaper() make checkpointer exit and then
> the stats collector exit. The shutdown terminates the processes in this order.
> 
> On the other hand, with the patch, the stats collector exits by SIGTERM
> before checkpointer exits. This is not normal order of processes to exit in
> shutdown.
> 
> To address this issue, one idea is to use SIGUSR2 for normal exit of the stats
> collector, instead of SIGTERM. If we do this, TerminateChildren(SIGTERM)
> cannot terminate the stats collector. Thought?
> 
> If we adopt this idea, the detail comment about why SIGUSR2 is used for that
> needs to be added.

Thanks for your comments. I agreed your concern and suggestion.
Additionally, we need to consider system shutdown cycle too as
CheckpointerMain()'s comment said.

```
         * Note: we deliberately ignore SIGTERM, because during a standard Unix
         * system shutdown cycle, init will SIGTERM all processes at once.  We
         * want to wait for the backends to exit, whereupon the postmaster will
         * tell us it's okay to shut down (via SIGUSR2)
```

I changed the signal from SIGTERM to SIGUSR2 and add the comments why SIGUSR2
is used.
(v6-0001-pgstat_avoid_writing_on_sigquit.patch)

Regards,

-- 
Masahiro Ikeda
NTT DATA CORPORATION
diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index dd9136a942..e09a7b3cad 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -64,9 +64,27 @@ SignalHandlerForConfigReload(SIGNAL_ARGS)
 }
 
 /*
- * Simple signal handler for exiting quickly as if due to a crash.
+ * Simple signal handler for processes which have not yet touched or do not
+ * touch shared memory to exit quickly.
  *
- * Normally, this would be used for handling SIGQUIT.
+ * If processes already touched shared memory, call
+ * SignalHandlerForCrashExit() because shared memory may be corrupted.
+ */
+void
+SignalHandlerForNonCrashExit(SIGNAL_ARGS)
+{
+	/*
+	 * Since we don't touch shared memory, we can just pull the plug and exit
+	 * without running any atexit handlers.
+	 */
+	_exit(1);
+}
+
+/*
+ * Simple signal handler for processes which have touched shared memory to
+ * exit quickly.
+ *
+ * Normally, this would be used for handling SIGQUIT as if due to a crash.
  */
 void
 SignalHandlerForCrashExit(SIGNAL_ARGS)
@@ -93,9 +111,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS)
  * shut down and exit.
  *
  * Typically, this handler would be used for SIGTERM, but some processes use
- * other signals. In particular, the checkpointer exits on SIGUSR2, the
- * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT
- * or SIGTERM.
+ * other signals. In particular, the checkpointer and the stats collector exit
+ * on SIGUSR2, and the WAL writer exits on either SIGINT or SIGTERM.
  *
  * ShutdownRequestPending should be checked at a convenient place within the
  * main loop, or else the main loop should call HandleMainLoopInterrupts.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b7af7c2707..53b84eda56 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,7 +1,21 @@
 /* ----------
  * pgstat.c
  *
- *	All the statistics collector stuff hacked up in one big, ugly file.
+ * All the statistics collector stuff hacked up in one big, ugly file.
+ *
+ * The statistics collector is started by the postmaster as soon as the
+ * startup subprocess finishes.  It remains alive until the postmaster
+ * commands it to terminate.  Normal termination is by SIGUSR2 after the
+ * checkpointer must exit(0), which instructs the statistics collector
+ * to save the final statistics to reuse at next startup and then exit(0).
+ * Emergency termination is by SIGQUIT; like any backend, the statistics
+ * collector will exit quickly without saving the final statistics.  It's
+ * ok because the startup process will remove all statistics at next
+ * startup after emergency termination.
+ *
+ * Because the statistics collector doesn't touch shared memory, even if
+ * the statistics collector exits unexpectedly, the postmaster doesn't treat
+ * it as a crash.  The postmaster will just try to restart a new one.
  *
  *	TODO:	- Separate collector, postmaster and backend stuff
  *			  into different files.
@@ -724,6 +738,7 @@ pgstat_reset_remove_files(const char *directory)
 
 		snprintf(fname, sizeof(fname), "%s/%s", directory,
 				 entry->d_name);
+		elog(DEBUG2, "removing stats file \"%s\"", fname);
 		unlink(fname);
 	}
 	FreeDir(dir);
@@ -4821,17 +4836,31 @@ PgstatCollectorMain(int argc, char *argv[])
 
 	/*
 	 * Ignore all signals usually bound to some action in the postmaster,
-	 * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-	 * support latch operations, because we only use a local latch.
+	 * except SIGHUP, SIGQUIT and SIGUSR2.  Note we don't need a SIGUSR1
+	 * handler to support latch operations, because we only use a local latch.
+	 *
+	 * We deliberately ignore SIGTERM and exit in correct order because we
+	 * want to collect the stats sent during the shutdown from all processes.
+	 * SIGTERM will be received during a standard Unix system shutdown cycle
+	 * because init will SIGTERM all processes at once, and the postmaster
+	 * will SIGTERM all processes at once when recovery_target_action=shutdown
+	 * and the startup process exits after reaching the recovery target.  We
+	 * want to wait for the checkpointer, which is the process sends the stats
+	 * finally, to exit, whereupon the postmaster will tell us it's okay to
+	 * shut down (via SIGUSR2)
+	 *
+	 * If SIGQUIT is received, exit quickly without doing any additional work,
+	 * for example writing stats files.  We arrange to do _exit(1) because the
+	 * stats collector doesn't touch shared memory.
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGINT, SIG_IGN);
 	pqsignal(SIGTERM, SIG_IGN);
-	pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
+	pqsignal(SIGQUIT, SignalHandlerForNonCrashExit);
 	pqsignal(SIGALRM, SIG_IGN);
 	pqsignal(SIGPIPE, SIG_IGN);
 	pqsignal(SIGUSR1, SIG_IGN);
-	pqsignal(SIGUSR2, SIG_IGN);
+	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
 	/* Reset some signals that are accepted by postmaster but not here */
 	pqsignal(SIGCHLD, SIG_DFL);
 	PG_SETMASK(&UnBlockSig);
@@ -4852,8 +4881,8 @@ PgstatCollectorMain(int argc, char *argv[])
 	AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
 
 	/*
-	 * Loop to process messages until we get SIGQUIT or detect ungraceful
-	 * death of our parent postmaster.
+	 * Loop to process messages until we get SIGUSR2, SIGQUIT or detect
+	 * ungraceful death of our parent postmaster.
 	 *
 	 * For performance reasons, we don't want to do ResetLatch/WaitLatch after
 	 * every message; instead, do that only after a recv() fails to obtain a
@@ -4871,7 +4900,7 @@ PgstatCollectorMain(int argc, char *argv[])
 		ResetLatch(MyLatch);
 
 		/*
-		 * Quit if we get SIGQUIT from the postmaster.
+		 * Quit if we get SIGUSR2 from the postmaster.
 		 */
 		if (ShutdownRequestPending)
 			break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index e9c4c62bb0..d15576bb20 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -401,7 +401,6 @@ static void SIGHUP_handler(SIGNAL_ARGS);
 static void pmdie(SIGNAL_ARGS);
 static void reaper(SIGNAL_ARGS);
 static void sigusr1_handler(SIGNAL_ARGS);
-static void process_startup_packet_die(SIGNAL_ARGS);
 static void dummy_handler(SIGNAL_ARGS);
 static void StartupPacketTimeoutHandler(void);
 static void CleanupBackend(int pid, int exitstatus);
@@ -3085,7 +3084,7 @@ reaper(SIGNAL_ARGS)
 				 * nothing left for it to do.
 				 */
 				if (PgStatPID != 0)
-					signal_child(PgStatPID, SIGQUIT);
+					signal_child(PgStatPID, SIGUSR2);
 			}
 			else
 			{
@@ -4320,8 +4319,14 @@ BackendInitialize(Port *port)
 	 * Exiting with _exit(1) is only possible because we have not yet touched
 	 * shared memory; therefore no outside-the-process state needs to get
 	 * cleaned up.
+	 *
+	 * One might be tempted to try to send a message, or log one, indicating
+	 * why we are disconnecting.  However, that would be quite unsafe in
+	 * itself. Also, it seems undesirable to provide clues about the
+	 * database's state to a client that has not yet completed authentication,
+	 * or even sent us a startup packet.
 	 */
-	pqsignal(SIGTERM, process_startup_packet_die);
+	pqsignal(SIGTERM, SignalHandlerForNonCrashExit);
 	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 	PG_SETMASK(&StartupBlockSig);
@@ -5274,25 +5279,6 @@ sigusr1_handler(SIGNAL_ARGS)
 	errno = save_errno;
 }
 
-/*
- * SIGTERM while processing startup packet.
- *
- * Running proc_exit() from a signal handler would be quite unsafe.
- * However, since we have not yet touched shared memory, we can just
- * pull the plug and exit without running any atexit handlers.
- *
- * One might be tempted to try to send a message, or log one, indicating
- * why we are disconnecting.  However, that would be quite unsafe in itself.
- * Also, it seems undesirable to provide clues about the database's state
- * to a client that has not yet completed authentication, or even sent us
- * a startup packet.
- */
-static void
-process_startup_packet_die(SIGNAL_ARGS)
-{
-	_exit(1);
-}
-
 /*
  * Dummy signal handler
  *
@@ -5309,7 +5295,7 @@ dummy_handler(SIGNAL_ARGS)
 
 /*
  * Timeout while processing startup packet.
- * As for process_startup_packet_die(), we exit via _exit(1).
+ * As for SignalHandlerForNonCrashExit(), we exit via _exit(1).
  */
 static void
 StartupPacketTimeoutHandler(void)
diff --git a/src/include/postmaster/interrupt.h b/src/include/postmaster/interrupt.h
index 85a1293ef1..3f3dc19e24 100644
--- a/src/include/postmaster/interrupt.h
+++ b/src/include/postmaster/interrupt.h
@@ -26,6 +26,7 @@ extern PGDLLIMPORT volatile sig_atomic_t ShutdownRequestPending;
 
 extern void HandleMainLoopInterrupts(void);
 extern void SignalHandlerForConfigReload(SIGNAL_ARGS);
+extern void SignalHandlerForNonCrashExit(SIGNAL_ARGS);
 extern void SignalHandlerForCrashExit(SIGNAL_ARGS);
 extern void SignalHandlerForShutdownRequest(SIGNAL_ARGS);
 

Reply via email to