Re: Draft for basic NUMA observability

Jakub Wartak Fri, 04 Apr 2025 00:36:29 -0700

On Fri, Apr 4, 2025 at 8:50 AM Bertrand Drouvot
<bertranddrouvot...@gmail.com> wrote:
>
> Hi,
>
> On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote:
> > On 4/3/25 15:12, Jakub Wartak wrote:
> > > On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <to...@vondra.me> wrote:
> > >
> > >> ...
> > >>
> > >> So unless someone can demonstrate a use case where this would matter,
> > >> I'd not worry about it too much.
> > >
> > > OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
> > > it's just that I don't have cycles left today and probably lack skills
> > > (i've never dealt with arrays so far) thus it would be slow to get it
> > > right... but I can pick up anything tomorrow morning.
> > >
> >
> > OK, I took a stab at reworking/simplifying this the way I proposed.
> > Here's v24 - needs more polishing, but hopefully enough to show what I
> > had in mind.
> >
> > It does these changes:
> >
> > 1) Drops 0002 with the pg_buffercache refactoring, because the new view
> > is not "extending" the existing one.
>
> I think that makes sense. One would just need to join on the pg_buffercache
> view to get more information about the buffer if needed.
>
> The pg_buffercache_numa_pages() doc needs an update though as I don't think 
> that
> "+  The <function>pg_buffercache_numa_pages()</function> provides the same
> information as <function>pg_buffercache_pages()</function>" is still true.
>
> > 2) Reworks pg_buffercache_num to return just three columns, bufferid,
> > page_num and node_id. page_num is a sequence starting from 0 for each
> > buffer.
>
> +1 on the idea
>
> > 3) It now builds an array of records, with one record per buffer/page.
> >
> > 4) I realized we don't really need to worry about buffers_per_page very
> > much, except for logging/debugging. There's always "at least one page"
> > per buffer, even if an incomplete one, so we can do this:
> >
> >    os_page_count = NBuffers * Max(1, pages_per_buffer);
> >
> > and then
> >
> >   for (i = 0; i < NBuffers; i++)
> >   {
> >       for (j = 0; j < Max(1, pages_per_buffer); j++)
>
> That's a nice simplification as we always need to take care of at least one 
> page
> per buffer.
>
> > and everything just works fine, I think.
>
> I think the same.
>
> > Opinions? I personally find this much cleaner / easier to understand.
>
> I agree that's easier to understand and that that looks correct.
>
> A few random comments:
>
> === 1
>
> It looks like that firstNumaTouch is not set to false anymore.
>
> === 2
>
> +               pfree(os_page_status);
> +               pfree(os_page_ptrs);
>
> Not sure that's needed, we should be in a short-lived memory context here
> (ExprContext or such).
>
> === 3
>
> +       ro_volatile_var = *(uint64 *)ptr
>
> space missing before "ptr"?
>


+my feedback as I've noticed that Bertrand already provided a review.

Right, the code is now simple , and that Max() is brilliant. I've
attached some review findings as .txt

0001 100%LGTM
0002 doc fix + pgident + Tomas, you should take Authored-by yourself
there for sure, I couldn't pull this off alone in time! So big thanks!
0003 fixes elog UINT64_FORMAT for ming32 (a little bit funny to have
NUMA on ming32...:))

When started with interleave=all on serious hardware, I'm getting (~5s
for s_b=64GB) from pg_buffercache_numa

     node_id |  count
    ---------+---------
           3 | 2097152
           0 | 2097152
           2 | 2097152
           1 | 2097152

so this is valid result (2097152*4 numa nodes*8192 buffer
size/1024/1024/1024 = 64GB)

Also with pgbench -i -s 20 , after ~8s:
    select c.relname, n.node_id, count(*) from pg_buffercache_numa n
    join pg_buffercache b on (b.bufferid = n.bufferid)
    join pg_class c on (c.relfilenode = b.relfilenode)
    group by c.relname, n.node_id order by count(*) desc;
                        relname                    | node_id | count
    -----------------------------------------------+---------+-------
     pgbench_accounts                              |       2 |  8217
     pgbench_accounts                              |       0 |  8190
     pgbench_accounts                              |       3 |  8189
     pgbench_accounts                              |       1 |  8187
     pg_statistic                                  |       2 |    32
     pg_operator                                   |       2 |    14
     pg_depend                                     |       3 |    14
     [..]

pg_shm_allocations_numa also looks good.

I think v24+tiny fixes is good enough to go in.

-J.

diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 59dbbd2b25e..3d9032efafb 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -45,9 +45,10 @@
  </para>
 
  <para>
-  The <function>pg_buffercache_numa_pages()</function> provides the same 
information
-  as <function>pg_buffercache_pages()</function> but is slower because it also
-  provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
+  The <function>pg_buffercache_numa_pages()</function> provides
+  <acronym>NUMA</acronym> node mappings for shared buffer entries. This
+  information is not part of <function>pg_buffercache_pages()</function>
+  itself, as it is much slower to retrieve.
   The <structname>pg_buffercache_numa</structname> view wraps the function for
   convenient use.
  </para>

diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c 
b/contrib/pg_buffercache/pg_buffercache_pages.c
index d653f4af394..5526dee7171 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -67,7 +67,7 @@ typedef struct
        uint32          bufferid;
        int32           numa_page;
        int32           numa_node;
-} BufferCacheNumaRec;
+}                      BufferCacheNumaRec;
 
 /*
  * Function context for data persisting over repeated calls.
@@ -79,7 +79,7 @@ typedef struct
        int                     pages_per_buffer;
        int                     os_page_size;
        BufferCacheNumaRec *record;
-} BufferCacheNumaContext;
+}                      BufferCacheNumaContext;
 
 
 /*
@@ -454,7 +454,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
                         */
                        for (j = 0; j < Max(1, pages_per_buffer); j++)
                        {
-                               char *buffptr = (char *) BufferGetBlock(i + 1);
+                               char       *buffptr = (char *) BufferGetBlock(i 
+ 1);
 
                                fctx->record[idx].bufferid = bufferid;
                                fctx->record[idx].numa_page = j;
@@ -480,8 +480,8 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
                        elog(ERROR, "failed NUMA pages inquiry: %m");
 
                /*
-                * Update the entries with NUMA node ID. The status array is 
indexed the
-                * same way as the entry index.
+                * Update the entries with NUMA node ID. The status array is 
indexed
+                * the same way as the entry index.
                 */
                for (i = 0; i < os_page_count; i++)
                {

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 4313e6db62c..e26af975a7d 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -677,7 +677,7 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
                        if (s >= 0 && s <= max_nodes)
                                nodes[s]++;
                        else
-                               elog(ERROR, "invalid NUMA node id outside of 
allowed range [0, %ld]: %d", max_nodes, s);
+                               elog(ERROR, "invalid NUMA node id outside of 
allowed range [0, " UINT64_FORMAT "]: %d", max_nodes, s);
                }
 
                for (i = 0; i <= max_nodes; i++)

Re: Draft for basic NUMA observability

Reply via email to