On 2021/03/23 11:40, Fujii Masao wrote: > > > On 2021/03/23 9:05, Masahiro Ikeda wrote: >> Yes. I attached the v5 patch based on v3 patch. >> I renamed SignalHandlerForUnsafeExit() and fixed the following comment. > > Thanks for updating the patch! > > When the startup process exits because of recovery_target_action=shutdown, > reaper() calls TerminateChildren(SIGTERM). This function sends SIGTERM to > the stats collector. Currently the stats collector ignores SIGTERM, but with > the patch it exits normally. This change of behavior might be problematic. > > That is, TerminateChildren(SIGTERM) sends SIGTERM to various processes. > But currently the stats collector and checkpointer don't exit even when > SIGTERM arrives because they ignore SIGTERM. After several processes > other than the stats collector and checkpointer exit by SIGTERM, > PostmasterStateMachine() and reaper() make checkpointer exit and then > the stats collector exit. The shutdown terminates the processes in this order. > > On the other hand, with the patch, the stats collector exits by SIGTERM > before checkpointer exits. This is not normal order of processes to exit in > shutdown. > > To address this issue, one idea is to use SIGUSR2 for normal exit of the stats > collector, instead of SIGTERM. If we do this, TerminateChildren(SIGTERM) > cannot terminate the stats collector. Thought? > > If we adopt this idea, the detail comment about why SIGUSR2 is used for that > needs to be added.
Thanks for your comments. I agreed your concern and suggestion. Additionally, we need to consider system shutdown cycle too as CheckpointerMain()'s comment said. ``` * Note: we deliberately ignore SIGTERM, because during a standard Unix * system shutdown cycle, init will SIGTERM all processes at once. We * want to wait for the backends to exit, whereupon the postmaster will * tell us it's okay to shut down (via SIGUSR2) ``` I changed the signal from SIGTERM to SIGUSR2 and add the comments why SIGUSR2 is used. (v6-0001-pgstat_avoid_writing_on_sigquit.patch) Regards, -- Masahiro Ikeda NTT DATA CORPORATION
diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c index dd9136a942..e09a7b3cad 100644 --- a/src/backend/postmaster/interrupt.c +++ b/src/backend/postmaster/interrupt.c @@ -64,9 +64,27 @@ SignalHandlerForConfigReload(SIGNAL_ARGS) } /* - * Simple signal handler for exiting quickly as if due to a crash. + * Simple signal handler for processes which have not yet touched or do not + * touch shared memory to exit quickly. * - * Normally, this would be used for handling SIGQUIT. + * If processes already touched shared memory, call + * SignalHandlerForCrashExit() because shared memory may be corrupted. + */ +void +SignalHandlerForNonCrashExit(SIGNAL_ARGS) +{ + /* + * Since we don't touch shared memory, we can just pull the plug and exit + * without running any atexit handlers. + */ + _exit(1); +} + +/* + * Simple signal handler for processes which have touched shared memory to + * exit quickly. + * + * Normally, this would be used for handling SIGQUIT as if due to a crash. */ void SignalHandlerForCrashExit(SIGNAL_ARGS) @@ -93,9 +111,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS) * shut down and exit. * * Typically, this handler would be used for SIGTERM, but some processes use - * other signals. In particular, the checkpointer exits on SIGUSR2, the - * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT - * or SIGTERM. + * other signals. In particular, the checkpointer and the stats collector exit + * on SIGUSR2, and the WAL writer exits on either SIGINT or SIGTERM. * * ShutdownRequestPending should be checked at a convenient place within the * main loop, or else the main loop should call HandleMainLoopInterrupts. diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index b7af7c2707..53b84eda56 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -1,7 +1,21 @@ /* ---------- * pgstat.c * - * All the statistics collector stuff hacked up in one big, ugly file. + * All the statistics collector stuff hacked up in one big, ugly file. + * + * The statistics collector is started by the postmaster as soon as the + * startup subprocess finishes. It remains alive until the postmaster + * commands it to terminate. Normal termination is by SIGUSR2 after the + * checkpointer must exit(0), which instructs the statistics collector + * to save the final statistics to reuse at next startup and then exit(0). + * Emergency termination is by SIGQUIT; like any backend, the statistics + * collector will exit quickly without saving the final statistics. It's + * ok because the startup process will remove all statistics at next + * startup after emergency termination. + * + * Because the statistics collector doesn't touch shared memory, even if + * the statistics collector exits unexpectedly, the postmaster doesn't treat + * it as a crash. The postmaster will just try to restart a new one. * * TODO: - Separate collector, postmaster and backend stuff * into different files. @@ -724,6 +738,7 @@ pgstat_reset_remove_files(const char *directory) snprintf(fname, sizeof(fname), "%s/%s", directory, entry->d_name); + elog(DEBUG2, "removing stats file \"%s\"", fname); unlink(fname); } FreeDir(dir); @@ -4821,17 +4836,31 @@ PgstatCollectorMain(int argc, char *argv[]) /* * Ignore all signals usually bound to some action in the postmaster, - * except SIGHUP and SIGQUIT. Note we don't need a SIGUSR1 handler to - * support latch operations, because we only use a local latch. + * except SIGHUP, SIGQUIT and SIGUSR2. Note we don't need a SIGUSR1 + * handler to support latch operations, because we only use a local latch. + * + * We deliberately ignore SIGTERM and exit in correct order because we + * want to collect the stats sent during the shutdown from all processes. + * SIGTERM will be received during a standard Unix system shutdown cycle + * because init will SIGTERM all processes at once, and the postmaster + * will SIGTERM all processes at once when recovery_target_action=shutdown + * and the startup process exits after reaching the recovery target. We + * want to wait for the checkpointer, which is the process sends the stats + * finally, to exit, whereupon the postmaster will tell us it's okay to + * shut down (via SIGUSR2) + * + * If SIGQUIT is received, exit quickly without doing any additional work, + * for example writing stats files. We arrange to do _exit(1) because the + * stats collector doesn't touch shared memory. */ pqsignal(SIGHUP, SignalHandlerForConfigReload); pqsignal(SIGINT, SIG_IGN); pqsignal(SIGTERM, SIG_IGN); - pqsignal(SIGQUIT, SignalHandlerForShutdownRequest); + pqsignal(SIGQUIT, SignalHandlerForNonCrashExit); pqsignal(SIGALRM, SIG_IGN); pqsignal(SIGPIPE, SIG_IGN); pqsignal(SIGUSR1, SIG_IGN); - pqsignal(SIGUSR2, SIG_IGN); + pqsignal(SIGUSR2, SignalHandlerForShutdownRequest); /* Reset some signals that are accepted by postmaster but not here */ pqsignal(SIGCHLD, SIG_DFL); PG_SETMASK(&UnBlockSig); @@ -4852,8 +4881,8 @@ PgstatCollectorMain(int argc, char *argv[]) AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL); /* - * Loop to process messages until we get SIGQUIT or detect ungraceful - * death of our parent postmaster. + * Loop to process messages until we get SIGUSR2, SIGQUIT or detect + * ungraceful death of our parent postmaster. * * For performance reasons, we don't want to do ResetLatch/WaitLatch after * every message; instead, do that only after a recv() fails to obtain a @@ -4871,7 +4900,7 @@ PgstatCollectorMain(int argc, char *argv[]) ResetLatch(MyLatch); /* - * Quit if we get SIGQUIT from the postmaster. + * Quit if we get SIGUSR2 from the postmaster. */ if (ShutdownRequestPending) break; diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index e9c4c62bb0..d15576bb20 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -401,7 +401,6 @@ static void SIGHUP_handler(SIGNAL_ARGS); static void pmdie(SIGNAL_ARGS); static void reaper(SIGNAL_ARGS); static void sigusr1_handler(SIGNAL_ARGS); -static void process_startup_packet_die(SIGNAL_ARGS); static void dummy_handler(SIGNAL_ARGS); static void StartupPacketTimeoutHandler(void); static void CleanupBackend(int pid, int exitstatus); @@ -3085,7 +3084,7 @@ reaper(SIGNAL_ARGS) * nothing left for it to do. */ if (PgStatPID != 0) - signal_child(PgStatPID, SIGQUIT); + signal_child(PgStatPID, SIGUSR2); } else { @@ -4320,8 +4319,14 @@ BackendInitialize(Port *port) * Exiting with _exit(1) is only possible because we have not yet touched * shared memory; therefore no outside-the-process state needs to get * cleaned up. + * + * One might be tempted to try to send a message, or log one, indicating + * why we are disconnecting. However, that would be quite unsafe in + * itself. Also, it seems undesirable to provide clues about the + * database's state to a client that has not yet completed authentication, + * or even sent us a startup packet. */ - pqsignal(SIGTERM, process_startup_packet_die); + pqsignal(SIGTERM, SignalHandlerForNonCrashExit); /* SIGQUIT handler was already set up by InitPostmasterChild */ InitializeTimeouts(); /* establishes SIGALRM handler */ PG_SETMASK(&StartupBlockSig); @@ -5274,25 +5279,6 @@ sigusr1_handler(SIGNAL_ARGS) errno = save_errno; } -/* - * SIGTERM while processing startup packet. - * - * Running proc_exit() from a signal handler would be quite unsafe. - * However, since we have not yet touched shared memory, we can just - * pull the plug and exit without running any atexit handlers. - * - * One might be tempted to try to send a message, or log one, indicating - * why we are disconnecting. However, that would be quite unsafe in itself. - * Also, it seems undesirable to provide clues about the database's state - * to a client that has not yet completed authentication, or even sent us - * a startup packet. - */ -static void -process_startup_packet_die(SIGNAL_ARGS) -{ - _exit(1); -} - /* * Dummy signal handler * @@ -5309,7 +5295,7 @@ dummy_handler(SIGNAL_ARGS) /* * Timeout while processing startup packet. - * As for process_startup_packet_die(), we exit via _exit(1). + * As for SignalHandlerForNonCrashExit(), we exit via _exit(1). */ static void StartupPacketTimeoutHandler(void) diff --git a/src/include/postmaster/interrupt.h b/src/include/postmaster/interrupt.h index 85a1293ef1..3f3dc19e24 100644 --- a/src/include/postmaster/interrupt.h +++ b/src/include/postmaster/interrupt.h @@ -26,6 +26,7 @@ extern PGDLLIMPORT volatile sig_atomic_t ShutdownRequestPending; extern void HandleMainLoopInterrupts(void); extern void SignalHandlerForConfigReload(SIGNAL_ARGS); +extern void SignalHandlerForNonCrashExit(SIGNAL_ARGS); extern void SignalHandlerForCrashExit(SIGNAL_ARGS); extern void SignalHandlerForShutdownRequest(SIGNAL_ARGS);