Hi, Commit 7389aad6 started using WaitEventSetWait() to wait for incoming connections. Before that, we used select(), for which we have our own implementation for Windows.
While hacking on patches to rip a bunch of unused Win32 socket wrapper code out, I twigged that the old pgwin32_select() code was careful to report multiple sockets at once by brute force polling of all of them, while WaitEventSetWait() doesn't do that: it reports just one event, because that's what the Windows WaitForMultipleEvents() syscall does. I guess this means you can probably fill up the listen queue of server socket 1 to prevent server socket 2 from ever being serviced, whereas on Unix we'll accept one connection at a time from each in round-robin fashion. I think we could get the same effect as pgwin32_select() more cheaply by doing the initial WaitForMultipleEvents() call with the caller's timeout exactly as we do today, and then, while we have space, repeatedly calling WaitForMultipleEvents(handles=&events[last_signaled_event_index + 1], timeout=0) until it reports timeout. Windows always reports the signaled event with the lowest index in the array, so the idea is to poll the remaining part of the array without waiting, to check for any more. In the most common case of FEBE socket waits etc there would be no extra system call (because it uses nevents=1, so no space for more), and in the postmaster's main loop there would commonly be only one extra system call to determine that no other events have been signaled. The attached patch shows the idea. It's using an ugly goto, but I guess it should be made decent with a real loop; I just didn't want to change the indentation level for this POC. I mention this now because I'm not sure whether to consider this an 'open item' for 16, or merely an enhancement for 17. I guess the former, because someone might call that a new denial of service vector. On the other hand, if you fill up the listen queue for socket 1 with enough vigour, you're also denying service to socket 1, so I don't know if it's worth worrying about. Opinions on that? I don't have Windows to test any of this. Although this patch passes on CI, that means nothing, as I don't expect the new code to be reached, so I'm hoping to find a someone who would be able to set up such a test on Windows, ie putting elog in to the new code path and trying to reach it by connecting to two different ports fast enough to exercise the multiple event case. Or maybe latch.c really needs its own test suite.
From 733fd375578d8216fed9a81d4a1a575b1f542ca9 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Sat, 1 Apr 2023 12:36:12 +1300 Subject: [PATCH] Teach WaitEventSetWait to report multiple events on Windows. The WaitEventSet implementation on Windows has always reported only one event at a time, and always the "lowest" in its event array. Since commit 7389aad6 started using WaitEventSet to handle incoming socket connections, this unfairness might potentially upset someone who wants to handle incoming connection on multiple sockets. If one of them has a non-empty listen queue due to incoming connections, the other might never be serviced. The previously coding based on select() was fair in that way. Fix, by teaching WaitEventSetWait() to poll for extra events. No change in behavior in the common case of callers with nevents=1, but for the postmaster's main look, we'll drain all the events that can fit in the output buffer, which is deliberately set large enough to handle the maximum possible number of sockets. This brings the Windows behavior in line with Unix. --- src/backend/storage/ipc/latch.c | 34 +++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index f4123e7de7..cc7b572008 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -2025,6 +2025,8 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout, */ cur_event = (WaitEvent *) &set->events[rc - WAIT_OBJECT_0 - 1]; +loop: + occurred_events->pos = cur_event->pos; occurred_events->user_data = cur_event->user_data; occurred_events->events = 0; @@ -2044,6 +2046,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout, occurred_events->events = WL_LATCH_SET; occurred_events++; returned_events++; + nevents--; } } else if (cur_event->events == WL_POSTMASTER_DEATH) @@ -2063,6 +2066,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout, occurred_events->events = WL_POSTMASTER_DEATH; occurred_events++; returned_events++; + nevents--; } } else if (cur_event->events & WL_SOCKET_MASK) @@ -2124,6 +2128,36 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout, { occurred_events++; returned_events++; + nevents--; + } + } + + /* + * Do we have space to report more events that might also be signaled later + * in the array than cur_event? Being able to return multiple socket + * events at a time like the Unix implementations might be important for + * client code that wants to be able to service busy sockets fairly. + */ + if (nevents > 0) + { + int next_pos = cur_event->pos + 1; + int count = set->nevents - next_pos; + + if (count > 0) + { + /* + * Poll with zero timeout, and ignore errors now because we + * already have events to report. + */ + rc = WaitForMultipleObjects(count, + set->handles + next_pos + 1, + false, + 0); + if (rc >= WAIT_OBJECT_0 && rc < WAIT_OBJECT_0 + count) + { + cur_event = &set->events[next_pos + rc]; + goto loop; + } } } -- 2.40.0