On Thu, Jul 7, 2022 at 9:05 AM Thomas Munro <thomas.mu...@gmail.com> wrote:
> On Thu, Jul 7, 2022 at 9:03 AM Andres Freund <and...@anarazel.de> wrote:
> > On 2022-07-07 08:56:33 +1200, Thomas Munro wrote:
> > > On Thu, Jul 7, 2022 at 8:39 AM Andres Freund <and...@anarazel.de> wrote:
> > > > So I think we need: 1) block most signals, 2) a retry loop *without*
> > > > interrupt checks.

Here's a draft patch that tries to explain all this in the commit
message and comments.

Even if we go with this approach now, I think it's plausible that we
might want to reconsider this yet again one day, perhaps allocating
memory with some future asynchronous infrastructure while still
processing interrupts.  It's not very nice to hold up recovery or
ProcSignalBarrier for long operations.

I'm a little unclear about ftruncate() here.  I don't expect it to
report EINTR in other places where we use it (ie to make a local file
on a non-"slow device" smaller), because I expect that to be like
read(), write() etc which we don't wrap in EINTR loops.  Here you've
observed EINTR when messing around with a debugger*.  It seems
inconsistent to put posix_fallocate() in an EINTR retry loop for the
benefit of debuggers, but not ftruncate().  But perhaps that's good
enough, on the theory that posix_fallocate(1GB) is a very large target
and you have a decent chance of hitting it.

Another observation while staring at that ftruncate():  It's entirely
redundant on Linux, because we only ever call dsm_impl_posix_resize()
to make the segment bigger.  Before commit 3c60d0fa (v12) it was
possible to resize a segment to be smaller with dsm_resize(), so you
needed one or t'other depending on the requested size and we just
called both, but dsm_resize() wasn't ever used AFAIK and didn't even
work on all DSM implementations, among other problems, so we ripped it
out.  So... on master at least, we could also change the #ifdef to be
either-or.  While refactoring like that, I think we might as well also
rearrange the code so that the wait event is reported also for other
OSes, just in case it takes a long time.  See 0002 patch.

*It's funny that ftruncate() apparently doesn't automatically restart
for ptrace SIGCONT on Linux according to your report, while poll()
does according to my experiments, even though the latter is one of the
never-restart functions (it doesn't on other OSes I hack on, and you
feel the difference when debugging missing wakeup type bugs...).
Random implementation details...
From d22c937914906135fddc8635792641089ba3e2cd Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Thu, 7 Jul 2022 15:15:11 +1200
Subject: [PATCH 1/2] Block signals while allocating DSM memory.

On Linux, we call posix_fallocate() on shm_open()'d memory to avoid
later potential SIGBUS (see commit 899bd785).

Based on field reports of systems stuck in an EINTR retry loop there,
there, we made it possible to break out of that loop via slightly odd
coding where the CHECK_FOR_INTERRUPTS() call was somewhat removed from
the loop (see commit 422952ee).

On further reflection, that was not a great choice for at least two
reasons:

1.  If interrupts were held, the CHECK_FOR_INTERRUPTS() would do nothing
and the EINTR error would be surfaced to the user.

2.  If EINTR was reported but neither QueryCancelPending nor
ProcDiePending was set, then we'd dutifully retry, but with a bit more
understanding of how posix_fallocate() works, it's now clear that you
can get into a loop that never terminates.  posix_fallocate() is not a
function that can do some of the job and tell you about progress if it's
interrupted, it has to undo what it's done so far and report EINTR, and
if signals keep arriving faster than it can complete (cf recovery
conflict signals), you're stuck.

Therefore, for now, we'll simply block most signals to guarantee
progress.  SIGQUIT is not blocked (see InitPostmasterChild()), because
its expected handler doesn't return, and unblockable signals like
SIGCONT are not expected to arrive at a high rate.  For good measure,
we'll include the ftruncate() call in the blocked region.

Back-patch to all supported releases.

Reported-by: Alvaro Herrera <alvhe...@alvh.no-ip.org>
Reported-by: Nicola Contu <nicola.co...@gmail.com>
Discussion: https://postgr.es/m/20220701154105.jjfutmngoedgiad3%40alvherre.pgsql
---
 src/backend/storage/ipc/dsm_impl.c | 27 +++++++++++++++------------
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index 323862a3d2..aa47e5de70 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -62,6 +62,7 @@
 #endif
 
 #include "common/file_perm.h"
+#include "libpq/pqsignal.h"		/* for PG_SETMASK macro */
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "portability/mem.h"
@@ -306,14 +307,6 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
 		shm_unlink(name);
 		errno = save_errno;
 
-		/*
-		 * If we received a query cancel or termination signal, we will have
-		 * EINTR set here.  If the caller said that errors are OK here, check
-		 * for interrupts immediately.
-		 */
-		if (errno == EINTR && elevel >= ERROR)
-			CHECK_FOR_INTERRUPTS();
-
 		ereport(elevel,
 				(errcode_for_dynamic_shared_memory(),
 				 errmsg("could not resize shared memory segment \"%s\" to %zu bytes: %m",
@@ -362,6 +355,14 @@ dsm_impl_posix_resize(int fd, off_t size)
 {
 	int			rc;
 
+	/*
+	 * Block all blockable signals, except SIGQUIT.  posix_fallocate() can run
+	 * for quite a long time, and is an all-or-nothing operation.  If we
+	 * allowed SIGUSR1 to interrupt us repeatedly (for example, due to recovery
+	 * conflicts), the retry loop might never succeed.
+	 */
+	PG_SETMASK(&BlockSig);
+
 	/* Truncate (or extend) the file to the requested size. */
 	rc = ftruncate(fd, size);
 
@@ -377,15 +378,15 @@ dsm_impl_posix_resize(int fd, off_t size)
 	if (rc == 0)
 	{
 		/*
-		 * We may get interrupted.  If so, just retry unless there is an
-		 * interrupt pending.  This avoids the possibility of looping forever
-		 * if another backend is repeatedly trying to interrupt us.
+		 * We still use a traditional EINTR retry loop to handle SIGCONT.
+		 * posix_fallocate() doesn't restart automatically, and we don't want
+		 * this to fail if you attach a debugger.
 		 */
 		pgstat_report_wait_start(WAIT_EVENT_DSM_FILL_ZERO_WRITE);
 		do
 		{
 			rc = posix_fallocate(fd, 0, size);
-		} while (rc == EINTR && !(ProcDiePending || QueryCancelPending));
+		} while (rc == EINTR);
 		pgstat_report_wait_end();
 
 		/*
@@ -397,6 +398,8 @@ dsm_impl_posix_resize(int fd, off_t size)
 	}
 #endif							/* HAVE_POSIX_FALLOCATE && __linux__ */
 
+	PG_SETMASK(&UnBlockSig);
+
 	return rc;
 }
 
-- 
2.30.2

From 637839f50bc77a550acaa49385e6a7b6a18f8eb0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Thu, 7 Jul 2022 16:28:01 +1200
Subject: [PATCH 2/2] Remove redundant ftruncate() when allocating memory.

In early releases of the DSM infrastructure, it was possible to resize
segments.  That was removed in release 12 by commit 3c60d0fa.  Now the
ftruncate() + posix_fallocate() sequence during DSM segment creation is
has a redundant step: we're always extending from zero to the desired
size, so we might as well just call posix_fallocate().

Let's also include the remaining ftruncate() call (non-Linux POSIX
systems) in the wait event reporting, for good measure.
---
 src/backend/storage/ipc/dsm_impl.c | 48 ++++++++++++++----------------
 1 file changed, 22 insertions(+), 26 deletions(-)

diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index aa47e5de70..4a7492b37c 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -363,40 +363,36 @@ dsm_impl_posix_resize(int fd, off_t size)
 	 */
 	PG_SETMASK(&BlockSig);
 
-	/* Truncate (or extend) the file to the requested size. */
-	rc = ftruncate(fd, size);
-
+	pgstat_report_wait_start(WAIT_EVENT_DSM_FILL_ZERO_WRITE);
+#if defined(HAVE_POSIX_FALLOCATE) && defined(__linux__)
 	/*
-	 * On Linux, a shm_open fd is backed by a tmpfs file.  After resizing with
-	 * ftruncate, the file may contain a hole.  Accessing memory backed by a
+	 * On Linux, a shm_open fd is backed by a tmpfs file.  If we were to use
+	 * ftruncate, the file would contain a hole.  Accessing memory backed by a
 	 * hole causes tmpfs to allocate pages, which fails with SIGBUS if there
 	 * is no more tmpfs space available.  So we ask tmpfs to allocate pages
 	 * here, so we can fail gracefully with ENOSPC now rather than risking
 	 * SIGBUS later.
+	 *
+	 * We still use a traditional EINTR retry loop to handle SIGCONT.
+	 * posix_fallocate() doesn't restart automatically, and we don't want
+	 * this to fail if you attach a debugger.
 	 */
-#if defined(HAVE_POSIX_FALLOCATE) && defined(__linux__)
-	if (rc == 0)
+	do
 	{
-		/*
-		 * We still use a traditional EINTR retry loop to handle SIGCONT.
-		 * posix_fallocate() doesn't restart automatically, and we don't want
-		 * this to fail if you attach a debugger.
-		 */
-		pgstat_report_wait_start(WAIT_EVENT_DSM_FILL_ZERO_WRITE);
-		do
-		{
-			rc = posix_fallocate(fd, 0, size);
-		} while (rc == EINTR);
-		pgstat_report_wait_end();
+		rc = posix_fallocate(fd, 0, size);
+	} while (rc == EINTR);
 
-		/*
-		 * The caller expects errno to be set, but posix_fallocate() doesn't
-		 * set it.  Instead it returns error numbers directly.  So set errno,
-		 * even though we'll also return rc to indicate success or failure.
-		 */
-		errno = rc;
-	}
-#endif							/* HAVE_POSIX_FALLOCATE && __linux__ */
+	/*
+	 * The caller expects errno to be set, but posix_fallocate() doesn't
+	 * set it.  Instead it returns error numbers directly.  So set errno,
+	 * even though we'll also return rc to indicate success or failure.
+	 */
+	errno = rc;
+#else
+	/* Extend the file to the requested size. */
+	rc = ftruncate(fd, size);
+#endif
+	pgstat_report_wait_end();
 
 	PG_SETMASK(&UnBlockSig);
 
-- 
2.30.2

Reply via email to