Re: pgsql: Introduce pg_shmem_allocations_numa view

Tomas Vondra Fri, 27 Jun 2025 07:53:07 -0700

Here's three small patches, that should handle the issue


0001 - Adds the batching into pg_numa_query_pages, so that the callers
don't need to do anything.

The batching doesn't seem to cause any performance regression. 32-bit
systems can't use that much memory anyway, and on 64-bit systems the
batch is sufficiently large (1024).


0002 - Silences the valgrind about the memory touching. It replaces the
macro with a static inline function, and adds suppressions for both
32-bit and 64-bits. The 32-bit may be a bit pointless, because on my
rpi5 valgrind produces about a bunch of other stuff anyway. But doesn't
hurt.

The function now looks like this:

  static inline void
  pg_numa_touch_mem_if_required(void *ptr)
  {
      volatile uint64 touch pg_attribute_unused();
      touch = *(volatile uint64 *) ptr;
  }

I did a lot of testing on multiple systems to check replacing the macro
with a static inline function still works - and it seems it does. But if
someone thinks the function won't work, I'd like to know.


0003 - While working on these patches, it occurred to me we could/should
add CHECK_FOR_INTERRUPTS() into the batch loop. This querying can take
quite a bit of time, so letting people to interrupt it seems reasonable.
It wasn't possible with just one call into the kernel, but with the
batching we can add a CFI.


Please, take a look.

regards

-- 
Tomas Vondra

From 3d935f62665a18d96e6bec59cb1f3f7cd7daa068 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Fri, 27 Jun 2025 12:43:20 +0200
Subject: [PATCH 1/3] Add batching when calling numa_move_pages

There's a kernel bug in do_pages_stat(), resulting in numa_move_pages()
producing bogus status when querying location of memory pages. The bug
only affects systems combining 64-bit kernel and 32-bit user space. This
may seem uncommon, but we use such systems for building 32-bit Debian
packages (which happens in a 32-bit chroot).

This is a long-standing kernel bug (since 2010), affecting pretty much
all kernels, so it'll take time until all systems get a fixed kernel.
Luckily, we can work around that on our end, by batching the requests
the same way as in do_pages_stat(). On 32-bit systems we use batches of
16 pointers, same as do_pages_stat(). 64-bit systems are not affected,
so we use a much larger batch of 1024.

Reported-by: Christoph Berg <m...@debian.org>
Author: Christoph Berg <m...@debian.org>
Author: Bertrand Drouvot <bertranddrouvot...@gmail.com>
Discussion: https://postgr.es/m/aetdozlmtzdda...@msg.df7cb.de
---
 src/include/port/pg_numa.h |  2 +-
 src/port/pg_numa.c         | 45 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 40f1d324dcf..d707d149a43 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -29,7 +29,7 @@ extern PGDLLIMPORT int pg_numa_get_max_node(void);
 
 #else
 
-#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+#define pg_numa_touch_mem_if_required(ptr) \
 	do {} while(0)
 
 #endif
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 4b487a2a4e8..54ab9c70d56 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -29,6 +29,19 @@
 #include <numa.h>
 #include <numaif.h>
 
+/*
+ * numa_move_pages() batch size, has to be <= 16 to work around a kernel bug
+ * in do_pages_stat() (chunked by DO_PAGES_STAT_CHUNK_NR). By using the same
+ * batch size, we make it work even on unfixed kernels.
+ *
+ * 64-bit system are not affected by the bug, and so use much larger batches.
+ */
+#if SIZEOF_SIZE_T == 4
+#define NUMA_QUERY_BATCH_SIZE 16
+#else
+#define NUMA_QUERY_BATCH_SIZE 1024
+#endif
+
 /* libnuma requires initialization as per numa(3) on Linux */
 int
 pg_numa_init(void)
@@ -46,7 +59,37 @@ pg_numa_init(void)
 int
 pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
 {
-	return numa_move_pages(pid, count, pages, NULL, status, 0);
+	unsigned long	next = 0;
+	int				ret = 0;
+
+	/*
+	 * Batch pointers passed to numa_move_pages to NUMA_QUERY_BATCH_SIZE
+	 * items, to work around a kernel bug in do_pages_stat().
+	 */
+	while (next < count)
+	{
+		unsigned long count_batch = Min(count - next,
+										NUMA_QUERY_BATCH_SIZE);
+
+		/*
+		 * Bail out if any of the batches errors out (ret<0). We ignore
+		 * (ret>0) which is used to return number of nonmigrated pages,
+		 * but we're not migrating any pages here.
+		 */
+		ret = numa_move_pages(pid, count_batch, &pages[next], NULL, &status[next], 0);
+		if (ret < 0)
+		{
+			/* plain error, return as is */
+			return ret;
+		}
+
+		next += count_batch;
+	}
+
+	/* should have consumed the input array exactly */
+	Assert(next == count);
+
+	return 0;
 }
 
 int
-- 
2.49.0

From 613a85c50a17574fcd34689582ce00c879187463 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Fri, 27 Jun 2025 12:47:38 +0200
Subject: [PATCH 2/3] Silence valgrind about pg_numa_touch_mem_if_required

When querying NUMA status of pages in shared memory, we need to touch
the memory first to get valid results. This may trigger valgrind
reports, because some of the memory (e.g. unpinned buffers) may be
marked as noaccess.

Solved by adding a valgrind suppresion, to ignore this. An alternative
would be to adjust the access/noaccess status before touching the
memory, but that seems far too invasive - it would require all those
pages to have detailed knowledge of what the memory stores.

The pg_numa_touch_mem_if_required() macro is replaced with a plain
function. Macros are invisible to suppressions, so it'd have to suppress
reports for the caller - e.g. pg_get_shmem_allocations_numa(). Which
means we'd suppress reports for the whole function.

Reviewed-by: Christoph Berg <m...@debian.org>
Reviewed-by: Bertrand Drouvot <bertranddrouvot...@gmail.com>
Discussion: https://postgr.es/m/aetdozlmtzdda...@msg.df7cb.de
---
 contrib/pg_buffercache/pg_buffercache_pages.c |  3 +--
 src/backend/storage/ipc/shmem.c               |  4 +---
 src/include/port/pg_numa.h                    |  8 ++++++--
 src/tools/valgrind.supp                       | 14 ++++++++++++++
 4 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 4b007f6e1b0..ae0291e6e96 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -320,7 +320,6 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
 		uint64		os_page_count;
 		int			pages_per_buffer;
 		int			max_entries;
-		volatile uint64 touch pg_attribute_unused();
 		char	   *startptr,
 				   *endptr;
 
@@ -375,7 +374,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
 
 			/* Only need to touch memory once per backend process lifetime */
 			if (firstNumaTouch)
-				pg_numa_touch_mem_if_required(touch, ptr);
+				pg_numa_touch_mem_if_required(ptr);
 		}
 
 		Assert(idx == os_page_count);
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index c9ae3b45b76..ca3656fc76f 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -679,12 +679,10 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
 		 */
 		for (i = 0; i < shm_ent_page_count; i++)
 		{
-			volatile uint64 touch pg_attribute_unused();
-
 			page_ptrs[i] = startptr + (i * os_page_size);
 
 			if (firstNumaTouch)
-				pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+				pg_numa_touch_mem_if_required(page_ptrs[i]);
 
 			CHECK_FOR_INTERRUPTS();
 		}
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index d707d149a43..6c8b7103cc3 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -24,8 +24,12 @@ extern PGDLLIMPORT int pg_numa_get_max_node(void);
  * This is required on Linux, before pg_numa_query_pages() as we
  * need to page-fault before move_pages(2) syscall returns valid results.
  */
-#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
-	ro_volatile_var = *(volatile uint64 *) ptr
+static inline void
+pg_numa_touch_mem_if_required(void *ptr)
+{
+	volatile uint64 touch pg_attribute_unused();
+	touch = *(volatile uint64 *) ptr;
+}
 
 #else
 
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index 7ea464c8094..2ad5b81526d 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -180,3 +180,17 @@
    Memcheck:Cond
    fun:PyObject_Realloc
 }
+
+# NUMA introspection requires touching memory first, and some of it may
+# be marked as noacess (e.g. unpinned buffers). So just ignore that.
+{
+   pg_numa_touch_mem_if_required
+   Memcheck:Addr4
+   fun:pg_numa_touch_mem_if_required
+}
+
+{
+   pg_numa_touch_mem_if_required
+   Memcheck:Addr8
+   fun:pg_numa_touch_mem_if_required
+}
-- 
2.49.0

From add3768156d05382b2de1dd64d8c420e8c50f92b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Fri, 27 Jun 2025 16:42:43 +0200
Subject: [PATCH 3/3] Add CHECK_FOR_INTERRUPTS into pg_numa_query_pages

Querying the NUMA status can be quite time consuming. Thanks to the
batching, we can do CHECK_FOR_INTERRUPTS(), to allow users aborting
the execution.

Reviewed-by: Christoph Berg <m...@debian.org>
Reviewed-by: Bertrand Drouvot <bertranddrouvot...@gmail.com>
Discussion: https://postgr.es/m/aetdozlmtzdda...@msg.df7cb.de
---
 src/port/pg_numa.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 54ab9c70d56..f76876b2906 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -16,6 +16,7 @@
 #include "c.h"
 #include <unistd.h>
 
+#include "miscadmin.h"
 #include "port/pg_numa.h"
 
 /*
@@ -71,6 +72,8 @@ pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
 		unsigned long count_batch = Min(count - next,
 										NUMA_QUERY_BATCH_SIZE);
 
+		CHECK_FOR_INTERRUPTS();
+
 		/*
 		 * Bail out if any of the batches errors out (ret<0). We ignore
 		 * (ret>0) which is used to return number of nonmigrated pages,
-- 
2.49.0

Re: pgsql: Introduce pg_shmem_allocations_numa view

Reply via email to