Michael Paquier <mich...@paquier.xyz> writes: > The buildfarm has reported two similar failures when shutting down a > node: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2019-03-23%2022%3A28%3A59 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2019-04-16%2006%3A14%3A01
> In both cases, the instance cannot shut down because it times out, > waiting for the shutdown checkpoint to finish but I suspect that this > checkpoint actually never happens. Hmm, I don't think that that is actually where the problem is. In piculet's failure, the test script times out waiting for a "fast" shutdown of the standby server, and what we see in the standby's log is 2019-03-23 22:44:12.181 UTC [9731] LOG: received fast shutdown request 2019-03-23 22:44:12.181 UTC [9731] LOG: aborting any active transactions 2019-03-23 22:44:12.181 UTC [9960] FATAL: terminating walreceiver process due to administrator command 2019-03-23 22:50:13.088 UTC [9731] LOG: received immediate shutdown request where the last line indicates that the test script lost patience and issued an immediate shutdown. However, in a successful run of the test, the log looks like 2019-03-24 03:33:25.592 UTC [23816] LOG: received fast shutdown request 2019-03-24 03:33:25.592 UTC [23816] LOG: aborting any active transactions 2019-03-24 03:33:25.592 UTC [23895] FATAL: terminating walreceiver process due to administrator command 2019-03-24 03:33:25.595 UTC [23819] LOG: shutting down 2019-03-24 03:33:25.600 UTC [23816] LOG: database system is shut down 2019-03-24 03:33:25.696 UTC [23903] LOG: starting PostgreSQL 12devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.2.0-12) 8.2.0, 64-bit where the last line reflects restarting the server for the next test step. So in the failure case we don't see the "shutting down" message, which means we never got to ShutdownXLOG, so no checkpoint request was made. Even if we had got to ShutdownXLOG, the process is just executing the operation directly, it's not sending a signal asking some other process to do the checkpoint; so it's hard to see how either of the commits you mention could be involved. I think what we need to look for is reasons why (1) the postmaster never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's main loop doesn't get to noticing shutdown_requested. A rather scary point for (2) is that said main loop seems to be assuming that MyLatch a/k/a MyProc->procLatch is not used for any other purposes in the checkpointer process. If there were something, like say a condition variable wait, that would reset MyLatch at any time during a checkpoint, then we could very easily go to sleep at the bottom of the loop and not notice that there's a pending shutdown request. Now, c6c9474aa did not break this, because the latch resets that it added happen in other processes not the checkpointer. But I'm feeling suspicious that some other change we made recently might've borked it. And in general, it seems like we've managed to load a lot of potentially conflicting roles onto process latches. Do we need to think harder about establishing rules for multiplexed use of the process latch? I'm imagining some rule like "if you are not the outermost event loop of a process, you do not get to summarily clear MyLatch. Make sure to leave it set after waiting, if there was any possibility that it was set by something other than the specific event you're concerned with". regards, tom lane