Here's three small patches, that should handle the issue
0001 - Adds the batching into pg_numa_query_pages, so that the callers don't need to do anything. The batching doesn't seem to cause any performance regression. 32-bit systems can't use that much memory anyway, and on 64-bit systems the batch is sufficiently large (1024). 0002 - Silences the valgrind about the memory touching. It replaces the macro with a static inline function, and adds suppressions for both 32-bit and 64-bits. The 32-bit may be a bit pointless, because on my rpi5 valgrind produces about a bunch of other stuff anyway. But doesn't hurt. The function now looks like this: static inline void pg_numa_touch_mem_if_required(void *ptr) { volatile uint64 touch pg_attribute_unused(); touch = *(volatile uint64 *) ptr; } I did a lot of testing on multiple systems to check replacing the macro with a static inline function still works - and it seems it does. But if someone thinks the function won't work, I'd like to know. 0003 - While working on these patches, it occurred to me we could/should add CHECK_FOR_INTERRUPTS() into the batch loop. This querying can take quite a bit of time, so letting people to interrupt it seems reasonable. It wasn't possible with just one call into the kernel, but with the batching we can add a CFI. Please, take a look. regards -- Tomas Vondra
From 3d935f62665a18d96e6bec59cb1f3f7cd7daa068 Mon Sep 17 00:00:00 2001 From: Tomas Vondra <to...@vondra.me> Date: Fri, 27 Jun 2025 12:43:20 +0200 Subject: [PATCH 1/3] Add batching when calling numa_move_pages There's a kernel bug in do_pages_stat(), resulting in numa_move_pages() producing bogus status when querying location of memory pages. The bug only affects systems combining 64-bit kernel and 32-bit user space. This may seem uncommon, but we use such systems for building 32-bit Debian packages (which happens in a 32-bit chroot). This is a long-standing kernel bug (since 2010), affecting pretty much all kernels, so it'll take time until all systems get a fixed kernel. Luckily, we can work around that on our end, by batching the requests the same way as in do_pages_stat(). On 32-bit systems we use batches of 16 pointers, same as do_pages_stat(). 64-bit systems are not affected, so we use a much larger batch of 1024. Reported-by: Christoph Berg <m...@debian.org> Author: Christoph Berg <m...@debian.org> Author: Bertrand Drouvot <bertranddrouvot...@gmail.com> Discussion: https://postgr.es/m/aetdozlmtzdda...@msg.df7cb.de --- src/include/port/pg_numa.h | 2 +- src/port/pg_numa.c | 45 +++++++++++++++++++++++++++++++++++++- 2 files changed, 45 insertions(+), 2 deletions(-) diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h index 40f1d324dcf..d707d149a43 100644 --- a/src/include/port/pg_numa.h +++ b/src/include/port/pg_numa.h @@ -29,7 +29,7 @@ extern PGDLLIMPORT int pg_numa_get_max_node(void); #else -#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \ +#define pg_numa_touch_mem_if_required(ptr) \ do {} while(0) #endif diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c index 4b487a2a4e8..54ab9c70d56 100644 --- a/src/port/pg_numa.c +++ b/src/port/pg_numa.c @@ -29,6 +29,19 @@ #include <numa.h> #include <numaif.h> +/* + * numa_move_pages() batch size, has to be <= 16 to work around a kernel bug + * in do_pages_stat() (chunked by DO_PAGES_STAT_CHUNK_NR). By using the same + * batch size, we make it work even on unfixed kernels. + * + * 64-bit system are not affected by the bug, and so use much larger batches. + */ +#if SIZEOF_SIZE_T == 4 +#define NUMA_QUERY_BATCH_SIZE 16 +#else +#define NUMA_QUERY_BATCH_SIZE 1024 +#endif + /* libnuma requires initialization as per numa(3) on Linux */ int pg_numa_init(void) @@ -46,7 +59,37 @@ pg_numa_init(void) int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status) { - return numa_move_pages(pid, count, pages, NULL, status, 0); + unsigned long next = 0; + int ret = 0; + + /* + * Batch pointers passed to numa_move_pages to NUMA_QUERY_BATCH_SIZE + * items, to work around a kernel bug in do_pages_stat(). + */ + while (next < count) + { + unsigned long count_batch = Min(count - next, + NUMA_QUERY_BATCH_SIZE); + + /* + * Bail out if any of the batches errors out (ret<0). We ignore + * (ret>0) which is used to return number of nonmigrated pages, + * but we're not migrating any pages here. + */ + ret = numa_move_pages(pid, count_batch, &pages[next], NULL, &status[next], 0); + if (ret < 0) + { + /* plain error, return as is */ + return ret; + } + + next += count_batch; + } + + /* should have consumed the input array exactly */ + Assert(next == count); + + return 0; } int -- 2.49.0
From 613a85c50a17574fcd34689582ce00c879187463 Mon Sep 17 00:00:00 2001 From: Tomas Vondra <to...@vondra.me> Date: Fri, 27 Jun 2025 12:47:38 +0200 Subject: [PATCH 2/3] Silence valgrind about pg_numa_touch_mem_if_required When querying NUMA status of pages in shared memory, we need to touch the memory first to get valid results. This may trigger valgrind reports, because some of the memory (e.g. unpinned buffers) may be marked as noaccess. Solved by adding a valgrind suppresion, to ignore this. An alternative would be to adjust the access/noaccess status before touching the memory, but that seems far too invasive - it would require all those pages to have detailed knowledge of what the memory stores. The pg_numa_touch_mem_if_required() macro is replaced with a plain function. Macros are invisible to suppressions, so it'd have to suppress reports for the caller - e.g. pg_get_shmem_allocations_numa(). Which means we'd suppress reports for the whole function. Reviewed-by: Christoph Berg <m...@debian.org> Reviewed-by: Bertrand Drouvot <bertranddrouvot...@gmail.com> Discussion: https://postgr.es/m/aetdozlmtzdda...@msg.df7cb.de --- contrib/pg_buffercache/pg_buffercache_pages.c | 3 +-- src/backend/storage/ipc/shmem.c | 4 +--- src/include/port/pg_numa.h | 8 ++++++-- src/tools/valgrind.supp | 14 ++++++++++++++ 4 files changed, 22 insertions(+), 7 deletions(-) diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c index 4b007f6e1b0..ae0291e6e96 100644 --- a/contrib/pg_buffercache/pg_buffercache_pages.c +++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -320,7 +320,6 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) uint64 os_page_count; int pages_per_buffer; int max_entries; - volatile uint64 touch pg_attribute_unused(); char *startptr, *endptr; @@ -375,7 +374,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) /* Only need to touch memory once per backend process lifetime */ if (firstNumaTouch) - pg_numa_touch_mem_if_required(touch, ptr); + pg_numa_touch_mem_if_required(ptr); } Assert(idx == os_page_count); diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c index c9ae3b45b76..ca3656fc76f 100644 --- a/src/backend/storage/ipc/shmem.c +++ b/src/backend/storage/ipc/shmem.c @@ -679,12 +679,10 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS) */ for (i = 0; i < shm_ent_page_count; i++) { - volatile uint64 touch pg_attribute_unused(); - page_ptrs[i] = startptr + (i * os_page_size); if (firstNumaTouch) - pg_numa_touch_mem_if_required(touch, page_ptrs[i]); + pg_numa_touch_mem_if_required(page_ptrs[i]); CHECK_FOR_INTERRUPTS(); } diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h index d707d149a43..6c8b7103cc3 100644 --- a/src/include/port/pg_numa.h +++ b/src/include/port/pg_numa.h @@ -24,8 +24,12 @@ extern PGDLLIMPORT int pg_numa_get_max_node(void); * This is required on Linux, before pg_numa_query_pages() as we * need to page-fault before move_pages(2) syscall returns valid results. */ -#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \ - ro_volatile_var = *(volatile uint64 *) ptr +static inline void +pg_numa_touch_mem_if_required(void *ptr) +{ + volatile uint64 touch pg_attribute_unused(); + touch = *(volatile uint64 *) ptr; +} #else diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp index 7ea464c8094..2ad5b81526d 100644 --- a/src/tools/valgrind.supp +++ b/src/tools/valgrind.supp @@ -180,3 +180,17 @@ Memcheck:Cond fun:PyObject_Realloc } + +# NUMA introspection requires touching memory first, and some of it may +# be marked as noacess (e.g. unpinned buffers). So just ignore that. +{ + pg_numa_touch_mem_if_required + Memcheck:Addr4 + fun:pg_numa_touch_mem_if_required +} + +{ + pg_numa_touch_mem_if_required + Memcheck:Addr8 + fun:pg_numa_touch_mem_if_required +} -- 2.49.0
From add3768156d05382b2de1dd64d8c420e8c50f92b Mon Sep 17 00:00:00 2001 From: Tomas Vondra <to...@vondra.me> Date: Fri, 27 Jun 2025 16:42:43 +0200 Subject: [PATCH 3/3] Add CHECK_FOR_INTERRUPTS into pg_numa_query_pages Querying the NUMA status can be quite time consuming. Thanks to the batching, we can do CHECK_FOR_INTERRUPTS(), to allow users aborting the execution. Reviewed-by: Christoph Berg <m...@debian.org> Reviewed-by: Bertrand Drouvot <bertranddrouvot...@gmail.com> Discussion: https://postgr.es/m/aetdozlmtzdda...@msg.df7cb.de --- src/port/pg_numa.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c index 54ab9c70d56..f76876b2906 100644 --- a/src/port/pg_numa.c +++ b/src/port/pg_numa.c @@ -16,6 +16,7 @@ #include "c.h" #include <unistd.h> +#include "miscadmin.h" #include "port/pg_numa.h" /* @@ -71,6 +72,8 @@ pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status) unsigned long count_batch = Min(count - next, NUMA_QUERY_BATCH_SIZE); + CHECK_FOR_INTERRUPTS(); + /* * Bail out if any of the batches errors out (ret<0). We ignore * (ret>0) which is used to return number of nonmigrated pages, -- 2.49.0