WL_SOCKET_ACCEPT fairness on Windows

Thomas Munro Fri, 31 Mar 2023 17:43:19 -0700

Hi,

Commit 7389aad6 started using WaitEventSetWait() to wait for incoming
connections.  Before that, we used select(), for which we have our own
implementation for Windows.


While hacking on patches to rip a bunch of unused Win32 socket wrapper
code out, I twigged that the old pgwin32_select() code was careful to
report multiple sockets at once by brute force polling of all of them,
while WaitEventSetWait() doesn't do that: it reports just one event,
because that's what the Windows WaitForMultipleEvents() syscall does.
I guess this means you can probably fill up the listen queue of server
socket 1 to prevent server socket 2 from ever being serviced, whereas
on Unix we'll accept one connection at a time from each in round-robin
fashion.

I think we could get the same effect as pgwin32_select() more cheaply
by doing the initial WaitForMultipleEvents() call with the caller's
timeout exactly as we do today, and then, while we have space,
repeatedly calling
WaitForMultipleEvents(handles=&events[last_signaled_event_index + 1],
timeout=0) until it reports timeout.  Windows always reports the
signaled event with the lowest index in the array, so the idea is to
poll the remaining part of the array without waiting, to check for any
more.  In the most common case of FEBE socket waits etc there would be
no extra system call (because it uses nevents=1, so no space for
more), and in the postmaster's main loop there would commonly be only
one extra system call to determine that no other events have been
signaled.

The attached patch shows the idea.  It's using an ugly goto, but I
guess it should be made decent with a real loop; I just didn't want to
change the indentation level for this POC.

I mention this now because I'm not sure whether to consider this an
'open item' for 16, or merely an enhancement for 17.  I guess the
former, because someone might call that a new denial of service
vector.  On the other hand, if you fill up the listen queue for socket
1 with enough vigour, you're also denying service to socket 1, so I
don't know if it's worth worrying about.  Opinions on that?

I don't have Windows to test any of this.  Although this patch passes
on CI, that means nothing, as I don't expect the new code to be
reached, so I'm hoping to find a someone who would be able to set up
such a test on Windows, ie putting elog in to the new code path and
trying to reach it by connecting to two different ports fast enough to
exercise the multiple event case.

Or maybe latch.c really needs its own test suite.

From 733fd375578d8216fed9a81d4a1a575b1f542ca9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Sat, 1 Apr 2023 12:36:12 +1300
Subject: [PATCH] Teach WaitEventSetWait to report multiple events on Windows.

The WaitEventSet implementation on Windows has always reported only one
event at a time, and always the "lowest" in its event array.  Since
commit 7389aad6 started using WaitEventSet to handle incoming socket
connections, this unfairness might potentially upset someone who wants
to handle incoming connection on multiple sockets.  If one of them has a
non-empty listen queue due to incoming connections, the other might
never be serviced.  The previously coding based on select() was fair in
that way.

Fix, by teaching WaitEventSetWait() to poll for extra events.  No change
in behavior in the common case of callers with nevents=1, but for the
postmaster's main look, we'll drain all the events that can fit in the
output buffer, which is deliberately set large enough to handle the
maximum possible number of sockets.  This brings the Windows behavior in
line with Unix.
---
 src/backend/storage/ipc/latch.c | 34 +++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index f4123e7de7..cc7b572008 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -2025,6 +2025,8 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 	 */
 	cur_event = (WaitEvent *) &set->events[rc - WAIT_OBJECT_0 - 1];
 
+loop:
+
 	occurred_events->pos = cur_event->pos;
 	occurred_events->user_data = cur_event->user_data;
 	occurred_events->events = 0;
@@ -2044,6 +2046,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			occurred_events->events = WL_LATCH_SET;
 			occurred_events++;
 			returned_events++;
+			nevents--;
 		}
 	}
 	else if (cur_event->events == WL_POSTMASTER_DEATH)
@@ -2063,6 +2066,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			occurred_events->events = WL_POSTMASTER_DEATH;
 			occurred_events++;
 			returned_events++;
+			nevents--;
 		}
 	}
 	else if (cur_event->events & WL_SOCKET_MASK)
@@ -2124,6 +2128,36 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 		{
 			occurred_events++;
 			returned_events++;
+			nevents--;
+		}
+	}
+
+	/*
+	 * Do we have space to report more events that might also be signaled later
+	 * in the array than cur_event?  Being able to return multiple socket
+	 * events at a time like the Unix implementations might be important for
+	 * client code that wants to be able to service busy sockets fairly.
+	 */
+	if (nevents > 0)
+	{
+		int			next_pos = cur_event->pos + 1;
+		int			count = set->nevents - next_pos;
+
+		if (count > 0)
+		{
+			/*
+			 * Poll with zero timeout, and ignore errors now because we
+			 * already have events to report.
+			 */
+			rc = WaitForMultipleObjects(count,
+										set->handles + next_pos + 1,
+										false,
+										0);
+			if (rc >= WAIT_OBJECT_0 && rc < WAIT_OBJECT_0 + count)
+			{
+				cur_event = &set->events[next_pos + rc];
+				goto loop;
+			}
 		}
 	}
 
-- 
2.40.0

WL_SOCKET_ACCEPT fairness on Windows

Reply via email to