Re: Draft for basic NUMA observability

Andres Freund Thu, 27 Mar 2025 07:20:05 -0700

Hi,

On 2025-03-27 14:02:03 +0100, Jakub Wartak wrote:
>    setup_additional_packages_script: |
> -    #apt-get update
> -    #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
> +    apt-get update
> +    DEBIAN_FRONTEND=noninteractive apt-get -y install \
> +      libnuma1 \
> +      libnuma-dev


I think libnuma is installed on the relevant platforms, so you shouldn't need
to install it manually.

> +#
> +# libnuma
> +#
> +AC_MSG_CHECKING([whether to build with libnuma support])
> +PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],

Most other dependencies say "build with libxyz ..."


> +/*-------------------------------------------------------------------------
> + *
> + * pg_numa.h
> + *     Basic NUMA portability routines
> + *
> + *
> + * Copyright (c) 2025, PostgreSQL Global Development Group
> + *
> + * IDENTIFICATION
> + *   src/include/port/pg_numa.h
> + *
> + *-------------------------------------------------------------------------
> + */
> +#ifndef PG_NUMA_H
> +#define PG_NUMA_H
> +
> +#include "c.h"
> +#include "postgres.h"

Headers should never include either of those headers.  Nor should .c files
include both.


> @@ -200,6 +200,8 @@ pgxs_empty = [
>  
>    'ICU_LIBS',
>  
> +  'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
> +
>    'LIBURING_CFLAGS', 'LIBURING_LIBS',
>  ]

Maybe I am missing something, but are you actually defining and using those
LIBNUMA_* vars anywhere?


> +Size
> +pg_numa_get_pagesize(void)
> +{
> +     Size            os_page_size = sysconf(_SC_PAGESIZE);
> +
> +     if (huge_pages_status == HUGE_PAGES_ON)
> +             GetHugePageSize(&os_page_size, NULL);
> +
> +     return os_page_size;
> +}

Should this have a comment or an assertion that it can only be used after
shared memory startup? Because before that huge_pages_status won't be
meaningful?


> +#ifndef FRONTEND
> +/*
> + * XXX: not really tested as there is no way to trigger this in our
> + * current usage of libnuma.
> + *
> + * The libnuma built-in code can be seen here:
> + * https://github.com/numactl/numactl/blob/master/libnuma.c
> + *
> + */
> +void
> +numa_warn(int num, char *fmt,...)
> +{
> +     va_list         ap;
> +     int                     olde = errno;
> +     int                     needed;
> +     StringInfoData msg;
> +
> +     initStringInfo(&msg);
> +
> +     va_start(ap, fmt);
> +     needed = appendStringInfoVA(&msg, fmt, ap);
> +     va_end(ap);
> +     if (needed > 0)
> +     {
> +             enlargeStringInfo(&msg, needed);
> +             va_start(ap, fmt);
> +             appendStringInfoVA(&msg, fmt, ap);
> +             va_end(ap);
> +     }
> +
> +     ereport(WARNING,
> +                     (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
> +                      errmsg_internal("libnuma: WARNING: %s", msg.data)));

I think you would at least have to hold interrupts across this, as
ereport(WARNING) does CHECK_FOR_INTERRUPTS() and it would not be safe to jump
out of libnuma in case an interrupt has arrived.

> +Size
> +pg_numa_get_pagesize(void)
> +{
> +#ifndef WIN32
> +     Size            os_page_size = sysconf(_SC_PAGESIZE);
> +#else
> +     Size            os_page_size;
> +     SYSTEM_INFO sysinfo;
> +
> +     GetSystemInfo(&sysinfo);
> +     os_page_size = sysinfo.dwPageSize;
> +#endif
> +     if (huge_pages_status == HUGE_PAGES_ON)
> +             GetHugePageSize(&os_page_size, NULL);
> +     return os_page_size;
> +}


I would probably implement this once, outside of the big ifdef, with one more
ifdef inside, given that you're sharing the same implementation.


> From fde52bfc05470076753dcb3e38a846ef3f6defe9 Mon Sep 17 00:00:00 2001
> From: Jakub Wartak <[email protected]>
> Date: Wed, 19 Mar 2025 09:34:56 +0100
> Subject: [PATCH v16 2/4] This extracts code from contrib/pg_buffercache's
>  primary function to separate functions.
> 
> This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple()
> and get_buffercache_tuple() that help fill result tuplestores based on the
> buffercache contents. This will be used in a follow-up commit that implements
> NUMA observability in pg_buffercache.
> 
> Author: Jakub Wartak <[email protected]>
> Reviewed-by: Bertrand Drouvot <[email protected]>
> Discussion: 
> https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
> ---
>  contrib/pg_buffercache/pg_buffercache_pages.c | 317 ++++++++++--------
>  1 file changed, 178 insertions(+), 139 deletions(-)
> 
> diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c 
> b/contrib/pg_buffercache/pg_buffercache_pages.c
> index 62602af1775..86e0b8afe01 100644
> --- a/contrib/pg_buffercache/pg_buffercache_pages.c
> +++ b/contrib/pg_buffercache/pg_buffercache_pages.c
> @@ -14,7 +14,6 @@
>  #include "storage/buf_internals.h"
>  #include "storage/bufmgr.h"
>  
> -
>  #define NUM_BUFFERCACHE_PAGES_MIN_ELEM       8

Independent change.


>  #define NUM_BUFFERCACHE_PAGES_ELEM   9
>  #define NUM_BUFFERCACHE_SUMMARY_ELEM 5
> @@ -68,80 +67,192 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
>  PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
>  PG_FUNCTION_INFO_V1(pg_buffercache_evict);
>  
> -Datum
> -pg_buffercache_pages(PG_FUNCTION_ARGS)
> +/*
> + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
> + *
> + * This is almost identical to pg_buffercache_numa_pages(), but this one 
> performs
> + * memory mapping inquiries to display NUMA node information for each buffer.
> + */

If it's a helper routine it's probably not identical to
pg_buffercache_numa_pages()?



> +/*
> + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
> + *
> + * Build buffer cache information for a single buffer.
> + */
> +static void
> +pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx)
> +{

This isn't really building a tuple tuple? Seems somewhat confusing, because
get_buffercache_tuple() does actually build one.


> +     BufferDesc *bufHdr;
> +     uint32          buf_state;
> +
> +     bufHdr = GetBufferDescriptor(record_id);
> +     /* Lock each buffer header before inspecting. */
> +     buf_state = LockBufHdr(bufHdr);
> +
> +     fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr);
> +     fctx->record[record_id].relfilenumber = 
> BufTagGetRelNumber(&bufHdr->tag);
> +     fctx->record[record_id].reltablespace = bufHdr->tag.spcOid;
> +     fctx->record[record_id].reldatabase = bufHdr->tag.dbOid;
> +     fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag);
> +     fctx->record[record_id].blocknum = bufHdr->tag.blockNum;
> +     fctx->record[record_id].usagecount = 
> BUF_STATE_GET_USAGECOUNT(buf_state);
> +     fctx->record[record_id].pinning_backends = 
> BUF_STATE_GET_REFCOUNT(buf_state);

As above, I think this would be more readable if you put
fctx->record[record_id] into a local var.



> +static Datum
> +get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
> +{
> +     Datum           values[NUM_BUFFERCACHE_PAGES_ELEM];
> +     bool            nulls[NUM_BUFFERCACHE_PAGES_ELEM];
> +     HeapTuple       tuple;
> +
> +     values[0] = Int32GetDatum(fctx->record[record_id].bufferid);
> +     nulls[0] = false;
> +
> +     /*
> +      * Set all fields except the bufferid to null if the buffer is unused or
> +      * not valid.
> +      */
> +     if (fctx->record[record_id].blocknum == InvalidBlockNumber ||
> +             fctx->record[record_id].isvalid == false)
> +     {
> +             nulls[1] = true;
> +             nulls[2] = true;
> +             nulls[3] = true;
> +             nulls[4] = true;
> +             nulls[5] = true;
> +             nulls[6] = true;
> +             nulls[7] = true;
> +
> +             /* unused for v1.0 callers, but the array is always long enough 
> */
> +             nulls[8] = true;

I'd probably just memset the entire nulls array to either true or false,
instead of doing it one-by-one.


> +     }
> +     else
> +     {
> +             values[1] = 
> ObjectIdGetDatum(fctx->record[record_id].relfilenumber);
> +             nulls[1] = false;
> +             values[2] = 
> ObjectIdGetDatum(fctx->record[record_id].reltablespace);
> +             nulls[2] = false;
> +             values[3] = 
> ObjectIdGetDatum(fctx->record[record_id].reldatabase);
> +             nulls[3] = false;
> +             values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum);
> +             nulls[4] = false;
> +             values[5] = Int64GetDatum((int64) 
> fctx->record[record_id].blocknum);
> +             nulls[5] = false;
> +             values[6] = BoolGetDatum(fctx->record[record_id].isdirty);
> +             nulls[6] = false;
> +             values[7] = Int16GetDatum(fctx->record[record_id].usagecount);
> +             nulls[7] = false;

Seems like it would end up a lot more readable if you put
fctx->record[record_id] into a local variable.  Unfortunately that'd probably
be best done in one more commit ahead of the rest of the this one...


> @@ -0,0 +1,28 @@
> +SELECT NOT(pg_numa_available()) AS skip_test \gset
> +\if :skip_test
> +\quit
> +\endif

You could avoid the need for an alternative output file if you instead made
the queries do something like
  SELECT NOT pg_numa_available() OR count(*) ...



> --- /dev/null
> +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> @@ -0,0 +1,42 @@
> +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
> +
> +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
> +
> +-- Register the new function.
> +DROP VIEW pg_buffercache;
> +DROP FUNCTION pg_buffercache_pages();

I don't think we can just drop a view in the upgrade script. That will fail if
anybody created a view depending on pg_buffercache.


(Sorry, ran out of time / energy here, i had originally just wanted to comment
on the apt-get thing in the tests)


Greetings,

Andres Freund

Re: Draft for basic NUMA observability

Reply via email to