On Thu, Jul 7, 2022 at 9:05 AM Thomas Munro <thomas.mu...@gmail.com> wrote: > On Thu, Jul 7, 2022 at 9:03 AM Andres Freund <and...@anarazel.de> wrote: > > On 2022-07-07 08:56:33 +1200, Thomas Munro wrote: > > > On Thu, Jul 7, 2022 at 8:39 AM Andres Freund <and...@anarazel.de> wrote: > > > > So I think we need: 1) block most signals, 2) a retry loop *without* > > > > interrupt checks.
Here's a draft patch that tries to explain all this in the commit message and comments. Even if we go with this approach now, I think it's plausible that we might want to reconsider this yet again one day, perhaps allocating memory with some future asynchronous infrastructure while still processing interrupts. It's not very nice to hold up recovery or ProcSignalBarrier for long operations. I'm a little unclear about ftruncate() here. I don't expect it to report EINTR in other places where we use it (ie to make a local file on a non-"slow device" smaller), because I expect that to be like read(), write() etc which we don't wrap in EINTR loops. Here you've observed EINTR when messing around with a debugger*. It seems inconsistent to put posix_fallocate() in an EINTR retry loop for the benefit of debuggers, but not ftruncate(). But perhaps that's good enough, on the theory that posix_fallocate(1GB) is a very large target and you have a decent chance of hitting it. Another observation while staring at that ftruncate(): It's entirely redundant on Linux, because we only ever call dsm_impl_posix_resize() to make the segment bigger. Before commit 3c60d0fa (v12) it was possible to resize a segment to be smaller with dsm_resize(), so you needed one or t'other depending on the requested size and we just called both, but dsm_resize() wasn't ever used AFAIK and didn't even work on all DSM implementations, among other problems, so we ripped it out. So... on master at least, we could also change the #ifdef to be either-or. While refactoring like that, I think we might as well also rearrange the code so that the wait event is reported also for other OSes, just in case it takes a long time. See 0002 patch. *It's funny that ftruncate() apparently doesn't automatically restart for ptrace SIGCONT on Linux according to your report, while poll() does according to my experiments, even though the latter is one of the never-restart functions (it doesn't on other OSes I hack on, and you feel the difference when debugging missing wakeup type bugs...). Random implementation details...
From d22c937914906135fddc8635792641089ba3e2cd Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Thu, 7 Jul 2022 15:15:11 +1200 Subject: [PATCH 1/2] Block signals while allocating DSM memory. On Linux, we call posix_fallocate() on shm_open()'d memory to avoid later potential SIGBUS (see commit 899bd785). Based on field reports of systems stuck in an EINTR retry loop there, there, we made it possible to break out of that loop via slightly odd coding where the CHECK_FOR_INTERRUPTS() call was somewhat removed from the loop (see commit 422952ee). On further reflection, that was not a great choice for at least two reasons: 1. If interrupts were held, the CHECK_FOR_INTERRUPTS() would do nothing and the EINTR error would be surfaced to the user. 2. If EINTR was reported but neither QueryCancelPending nor ProcDiePending was set, then we'd dutifully retry, but with a bit more understanding of how posix_fallocate() works, it's now clear that you can get into a loop that never terminates. posix_fallocate() is not a function that can do some of the job and tell you about progress if it's interrupted, it has to undo what it's done so far and report EINTR, and if signals keep arriving faster than it can complete (cf recovery conflict signals), you're stuck. Therefore, for now, we'll simply block most signals to guarantee progress. SIGQUIT is not blocked (see InitPostmasterChild()), because its expected handler doesn't return, and unblockable signals like SIGCONT are not expected to arrive at a high rate. For good measure, we'll include the ftruncate() call in the blocked region. Back-patch to all supported releases. Reported-by: Alvaro Herrera <alvhe...@alvh.no-ip.org> Reported-by: Nicola Contu <nicola.co...@gmail.com> Discussion: https://postgr.es/m/20220701154105.jjfutmngoedgiad3%40alvherre.pgsql --- src/backend/storage/ipc/dsm_impl.c | 27 +++++++++++++++------------ 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c index 323862a3d2..aa47e5de70 100644 --- a/src/backend/storage/ipc/dsm_impl.c +++ b/src/backend/storage/ipc/dsm_impl.c @@ -62,6 +62,7 @@ #endif #include "common/file_perm.h" +#include "libpq/pqsignal.h" /* for PG_SETMASK macro */ #include "miscadmin.h" #include "pgstat.h" #include "portability/mem.h" @@ -306,14 +307,6 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size, shm_unlink(name); errno = save_errno; - /* - * If we received a query cancel or termination signal, we will have - * EINTR set here. If the caller said that errors are OK here, check - * for interrupts immediately. - */ - if (errno == EINTR && elevel >= ERROR) - CHECK_FOR_INTERRUPTS(); - ereport(elevel, (errcode_for_dynamic_shared_memory(), errmsg("could not resize shared memory segment \"%s\" to %zu bytes: %m", @@ -362,6 +355,14 @@ dsm_impl_posix_resize(int fd, off_t size) { int rc; + /* + * Block all blockable signals, except SIGQUIT. posix_fallocate() can run + * for quite a long time, and is an all-or-nothing operation. If we + * allowed SIGUSR1 to interrupt us repeatedly (for example, due to recovery + * conflicts), the retry loop might never succeed. + */ + PG_SETMASK(&BlockSig); + /* Truncate (or extend) the file to the requested size. */ rc = ftruncate(fd, size); @@ -377,15 +378,15 @@ dsm_impl_posix_resize(int fd, off_t size) if (rc == 0) { /* - * We may get interrupted. If so, just retry unless there is an - * interrupt pending. This avoids the possibility of looping forever - * if another backend is repeatedly trying to interrupt us. + * We still use a traditional EINTR retry loop to handle SIGCONT. + * posix_fallocate() doesn't restart automatically, and we don't want + * this to fail if you attach a debugger. */ pgstat_report_wait_start(WAIT_EVENT_DSM_FILL_ZERO_WRITE); do { rc = posix_fallocate(fd, 0, size); - } while (rc == EINTR && !(ProcDiePending || QueryCancelPending)); + } while (rc == EINTR); pgstat_report_wait_end(); /* @@ -397,6 +398,8 @@ dsm_impl_posix_resize(int fd, off_t size) } #endif /* HAVE_POSIX_FALLOCATE && __linux__ */ + PG_SETMASK(&UnBlockSig); + return rc; } -- 2.30.2
From 637839f50bc77a550acaa49385e6a7b6a18f8eb0 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Thu, 7 Jul 2022 16:28:01 +1200 Subject: [PATCH 2/2] Remove redundant ftruncate() when allocating memory. In early releases of the DSM infrastructure, it was possible to resize segments. That was removed in release 12 by commit 3c60d0fa. Now the ftruncate() + posix_fallocate() sequence during DSM segment creation is has a redundant step: we're always extending from zero to the desired size, so we might as well just call posix_fallocate(). Let's also include the remaining ftruncate() call (non-Linux POSIX systems) in the wait event reporting, for good measure. --- src/backend/storage/ipc/dsm_impl.c | 48 ++++++++++++++---------------- 1 file changed, 22 insertions(+), 26 deletions(-) diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c index aa47e5de70..4a7492b37c 100644 --- a/src/backend/storage/ipc/dsm_impl.c +++ b/src/backend/storage/ipc/dsm_impl.c @@ -363,40 +363,36 @@ dsm_impl_posix_resize(int fd, off_t size) */ PG_SETMASK(&BlockSig); - /* Truncate (or extend) the file to the requested size. */ - rc = ftruncate(fd, size); - + pgstat_report_wait_start(WAIT_EVENT_DSM_FILL_ZERO_WRITE); +#if defined(HAVE_POSIX_FALLOCATE) && defined(__linux__) /* - * On Linux, a shm_open fd is backed by a tmpfs file. After resizing with - * ftruncate, the file may contain a hole. Accessing memory backed by a + * On Linux, a shm_open fd is backed by a tmpfs file. If we were to use + * ftruncate, the file would contain a hole. Accessing memory backed by a * hole causes tmpfs to allocate pages, which fails with SIGBUS if there * is no more tmpfs space available. So we ask tmpfs to allocate pages * here, so we can fail gracefully with ENOSPC now rather than risking * SIGBUS later. + * + * We still use a traditional EINTR retry loop to handle SIGCONT. + * posix_fallocate() doesn't restart automatically, and we don't want + * this to fail if you attach a debugger. */ -#if defined(HAVE_POSIX_FALLOCATE) && defined(__linux__) - if (rc == 0) + do { - /* - * We still use a traditional EINTR retry loop to handle SIGCONT. - * posix_fallocate() doesn't restart automatically, and we don't want - * this to fail if you attach a debugger. - */ - pgstat_report_wait_start(WAIT_EVENT_DSM_FILL_ZERO_WRITE); - do - { - rc = posix_fallocate(fd, 0, size); - } while (rc == EINTR); - pgstat_report_wait_end(); + rc = posix_fallocate(fd, 0, size); + } while (rc == EINTR); - /* - * The caller expects errno to be set, but posix_fallocate() doesn't - * set it. Instead it returns error numbers directly. So set errno, - * even though we'll also return rc to indicate success or failure. - */ - errno = rc; - } -#endif /* HAVE_POSIX_FALLOCATE && __linux__ */ + /* + * The caller expects errno to be set, but posix_fallocate() doesn't + * set it. Instead it returns error numbers directly. So set errno, + * even though we'll also return rc to indicate success or failure. + */ + errno = rc; +#else + /* Extend the file to the requested size. */ + rc = ftruncate(fd, size); +#endif + pgstat_report_wait_end(); PG_SETMASK(&UnBlockSig); -- 2.30.2