On Sun, Jan 15, 2023 at 12:35 AM Thomas Munro <thomas.mu...@gmail.com> wrote:
> Here's a sketch of the first idea.

To hit the problem case, the signal needs to arrive in between the
latch->is_set check and the epoll_wait() call, and the handler needs
to take a while to get started.  (If it arrives before the
latch->is_set check we report WL_LATCH_SET immediately, and if it
arrives after the epoll_wait() call begins, we get EINTR and go back
around to the latch->is_set check.)  With some carefully placed sleeps
to simulate a CPU-starved system (see attached) I managed to get a
kill-then-connect sequence to produce:

2023-01-17 10:48:32.508 NZDT [555849] LOG:  nevents = 2
2023-01-17 10:48:32.508 NZDT [555849] LOG:  events[0] = WL_SOCKET_ACCEPT
2023-01-17 10:48:32.508 NZDT [555849] LOG:  events[1] = WL_LATCH_SET
2023-01-17 10:48:32.508 NZDT [555849] LOG:  received SIGHUP, reloading
configuration files

With the patch I posted, we process that in the order we want:

2023-01-17 11:06:31.340 NZDT [562262] LOG:  nevents = 2
2023-01-17 11:06:31.340 NZDT [562262] LOG:  events[1] = WL_LATCH_SET
2023-01-17 11:06:31.340 NZDT [562262] LOG:  received SIGHUP, reloading
configuration files
2023-01-17 11:06:31.344 NZDT [562262] LOG:  events[0] = WL_SOCKET_ACCEPT

Other thoughts:

Another idea would be to teach the latch infrastructure itself to
magically swap latch events to position 0.  Latches are usually
prioritised; it's only in this rare race case that they are not.

Or going the other way, I realise that we're lacking a "wait for
reload" mechanism as discussed in other threads (usually people want
this if they care about its effects on backends other than the
postmaster, where all bets are off and Andres once suggested the
ProcSignalBarrier hammer), and if we ever got something like that it
might be another solution to this particular problem.
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9cedc1b9f0..f5a310f844 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1732,6 +1732,8 @@ ServerLoop(void)
 								   lengthof(events),
 								   0 /* postmaster posts no wait_events */ );
 
+		elog(LOG, "nevents = %d", nevents);
+
 		/*
 		 * Latch set by signal handler, or new connection pending on any of
 		 * our sockets? If the latter, fork a child process to deal with it.
@@ -1740,6 +1742,7 @@ ServerLoop(void)
 		{
 			if (events[i].events & WL_LATCH_SET)
 			{
+				elog(LOG, "events[%i] = WL_LATCH_SET", i);
 				ResetLatch(MyLatch);
 
 				/* Process work requested via signal handlers. */
@@ -1756,6 +1759,7 @@ ServerLoop(void)
 			{
 				Port	   *port;
 
+				elog(LOG, "events[%i] = WL_SOCKET_ACCEPT", i);
 				port = ConnCreate(events[i].fd);
 				if (port)
 				{
@@ -2679,6 +2683,8 @@ handle_pm_reload_request_signal(SIGNAL_ARGS)
 {
 	int			save_errno = errno;
 
+	pg_usleep(10000);
+
 	pending_pm_reload_request = true;
 	SetLatch(MyLatch);
 
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index d79d71a851..99b0aade1c 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -1465,6 +1465,9 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 			break;
 		}
 
+		if (set->latch)
+			pg_usleep(1000000);
+
 		/*
 		 * Wait for events using the readiness primitive chosen at the top of
 		 * this file. If -1 is returned, a timeout has occurred, if 0 we have

Reply via email to