On Sun, Jan 15, 2023 at 12:35 AM Thomas Munro <thomas.mu...@gmail.com> wrote: > Here's a sketch of the first idea.
To hit the problem case, the signal needs to arrive in between the latch->is_set check and the epoll_wait() call, and the handler needs to take a while to get started. (If it arrives before the latch->is_set check we report WL_LATCH_SET immediately, and if it arrives after the epoll_wait() call begins, we get EINTR and go back around to the latch->is_set check.) With some carefully placed sleeps to simulate a CPU-starved system (see attached) I managed to get a kill-then-connect sequence to produce: 2023-01-17 10:48:32.508 NZDT [555849] LOG: nevents = 2 2023-01-17 10:48:32.508 NZDT [555849] LOG: events[0] = WL_SOCKET_ACCEPT 2023-01-17 10:48:32.508 NZDT [555849] LOG: events[1] = WL_LATCH_SET 2023-01-17 10:48:32.508 NZDT [555849] LOG: received SIGHUP, reloading configuration files With the patch I posted, we process that in the order we want: 2023-01-17 11:06:31.340 NZDT [562262] LOG: nevents = 2 2023-01-17 11:06:31.340 NZDT [562262] LOG: events[1] = WL_LATCH_SET 2023-01-17 11:06:31.340 NZDT [562262] LOG: received SIGHUP, reloading configuration files 2023-01-17 11:06:31.344 NZDT [562262] LOG: events[0] = WL_SOCKET_ACCEPT Other thoughts: Another idea would be to teach the latch infrastructure itself to magically swap latch events to position 0. Latches are usually prioritised; it's only in this rare race case that they are not. Or going the other way, I realise that we're lacking a "wait for reload" mechanism as discussed in other threads (usually people want this if they care about its effects on backends other than the postmaster, where all bets are off and Andres once suggested the ProcSignalBarrier hammer), and if we ever got something like that it might be another solution to this particular problem.
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 9cedc1b9f0..f5a310f844 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -1732,6 +1732,8 @@ ServerLoop(void) lengthof(events), 0 /* postmaster posts no wait_events */ ); + elog(LOG, "nevents = %d", nevents); + /* * Latch set by signal handler, or new connection pending on any of * our sockets? If the latter, fork a child process to deal with it. @@ -1740,6 +1742,7 @@ ServerLoop(void) { if (events[i].events & WL_LATCH_SET) { + elog(LOG, "events[%i] = WL_LATCH_SET", i); ResetLatch(MyLatch); /* Process work requested via signal handlers. */ @@ -1756,6 +1759,7 @@ ServerLoop(void) { Port *port; + elog(LOG, "events[%i] = WL_SOCKET_ACCEPT", i); port = ConnCreate(events[i].fd); if (port) { @@ -2679,6 +2683,8 @@ handle_pm_reload_request_signal(SIGNAL_ARGS) { int save_errno = errno; + pg_usleep(10000); + pending_pm_reload_request = true; SetLatch(MyLatch); diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index d79d71a851..99b0aade1c 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -1465,6 +1465,9 @@ WaitEventSetWait(WaitEventSet *set, long timeout, break; } + if (set->latch) + pg_usleep(1000000); + /* * Wait for events using the readiness primitive chosen at the top of * this file. If -1 is returned, a timeout has occurred, if 0 we have