Re: Adding basic NUMA awareness

Jakub Wartak Tue, 25 Nov 2025 06:13:16 -0800

Hi Tomas!

[..]
> Which I think is mostly the same thing you're saying, and you have the maps 
> to support it.


Right, the thread is kind of long, you were right back then, well but
at least we've got a solid explanation with data.

> Here's an updated version of the patch series.

Just for double confirmation, I've used those ones (v20251121*) and
they indeed interleaved parts of shm memory.

> It fixes a bunch of issues in pg_buffercache_pages.c - duplicate attnums
> and a incorrect array length.

You'll need to rebase again, pg_buffercache_numa got updated again on
Monday and clashes with 0006.

> The main change is in 0006 - it sets the default allocation policy for
> shmem to interleaving, before doing the explicit partitioning for shared
> buffers. It does it by calling numa_set_membind before the mmap(), and
> then numa_interleave_memory() on the allocated shmem. It does this to
> allow using MAP_POPULATE - but that's commented out by default.
>
> This does seem to solve the SIGBUS failures for me. I still think there
> might be a small chance of hitting that, because of locating an extra
> "boundary" page on one of the nodes. But it should be solvable by
> reserving a couple more pages.

I can confirm, never got any SIGBUS during the later described
benchmarks, so it's much better now.

> Jakub, what do you think?

On one side not using MAP_POPULATE gives instant startup, but on the
other it gives much better predictability latencies especially fresh
after starting up (this might matter to folks who like to benchmark --
us?, but initially I've just used it as a simple hack to touch
memory). I would be wary of using MAP_POPULATE with s_b when it would
be sized in hundreths of GBs, it could take minutes in startup, which
would be terrible if someone would hit SIGSEGV on production and
expect restart_after_crash=true to save him. I mean WAL redo crash
would be terrible, but that would be terrible * 2. Also pretty
long-term with DIO, we'll get much bigger s_b anyway (hopefully), so
it would hurt even more, so I think that would be a bad path(?)

I've benchmarked the thing in two scenarios (readonly pgbench < s_b
size across variations of code and connections and 2nd one with
seqconcurrrentscans) in solid stable conditions: 4s32c64t == 4 NUMA
nodes, 128GB RAM, 31GB shared_buffers dbsize ~29GB, 6.14.x, no idle
CPU states, no turbo boost, and so on, literally great home heater
when there's -3C outside!)

The data is baseline "100%" for master along with HP on/off (so it's
showing diff % from respective HP setting):

scenario I: pgbench -S

                 connections
branch   HP      1       8       64      128     1024
master   off     100.00% 100.00% 100.00% 100.00% 100.00%
master   on      100.00% 100.00% 100.00% 100.00% 100.00%
numa16   off     99.13%  100.46% 99.66%  99.44%  89.60%
numa16   on      101.80% 100.89% 99.36%  99.89%  93.43%
numa4    off     96.82%  100.61% 99.37%  99.92%  94.41%
numa4    on      101.83% 100.61% 99.35%  99.69%  101.48%
pgproc16 off     99.13%  100.84% 99.38%  99.85%  91.15%
pgproc16 on      101.72% 101.40% 99.72%  100.14% 95.20%
pgproc4  off     98.63%  101.44% 100.05% 100.14% 90.97%
pgproc4  on      101.05% 101.46% 99.92%  100.31% 97.60%
sweep16  off     99.53%  101.14% 100.71% 100.75% 101.52%
sweep16  on      97.63%  102.49% 100.42% 100.75% 105.56%
sweep4   off     99.43%  101.59% 100.06% 100.45% 104.63%
sweep4   on      97.69%  101.59% 100.70% 100.69% 104.70%

I would consider everything +/- 3% as noise (technically each branch
was a different compilation/ELF binary, as changing this #define
required to do so to get 4 vs 16; please see attached script). I miss
the explanation why without HP it deteriorates so much with for c=1024
with the patches.

scenario II: pgbench -f seqconcurrscans.pgb; 64 partitions from
pgbench --partitions=64 -i -s 2000 [~29GB] being hammered in modulo
without PQ by:
    \set num (:client_id % 8) + 1
    select sum(octet_length(filler)) from pgbench_accounts_:num;

                 connections
branch   HP      1       8       64      128
master   off     100.00% 100.00% 100.00% 100.00%
master   on      100.00% 100.00% 100.00% 100.00%
numa16   off     115.62% 108.87% 101.08% 111.56%
numa16   on      107.68% 104.90% 102.98% 105.51%
numa4    off     113.55% 111.41% 101.45% 113.10%
numa4    on      107.90% 106.60% 103.68% 106.98%
pgproc16 off     111.70% 108.27% 98.69%  109.36%
pgproc16 on      106.98% 100.69% 101.98% 103.42%
pgproc4  off     112.41% 106.15% 100.03% 112.03%
pgproc4  on      106.73% 105.77% 103.74% 101.13%
sweep16  off     100.63% 100.38% 98.41%  103.46%
sweep16  on      109.03% 99.15%  101.17% 99.19%
sweep4   off     102.04% 101.16% 101.71% 91.86%
sweep4   on      108.33% 101.69% 97.14%  100.92%

The benefit varies with like +3-10% depending on connection count.
Quite frankly I was expecting a little bit more, especially after
re-reading [1]. Maybe you preloaded it there using pg_prewarm? (here
I've randomly warmed it using pgbench). Probably it's something with
my test, I'll take yet another look hopefully soon. The good thing is
that it never crashed and I haven't seen any errors like "Bad address"
probably related to AIO as you saw in [1], perhaps I wasn't using
uring.

0007 (PROCs) still complains with "mbind: Invalid argument" (aligment issue)

-J.

[1] - 
https://www.postgresql.org/message-id/e4d7e6fc-b5c5-4288-991c-56219db2edd5%40vondra.me

master  128  off 63.323369
master  128  on 70.424227
master  1  off 1.257394
master  1  on 1.355974
master  64  off 83.681932
master  64  on 84.667119
master  8  off 10.096653
master  8  on 10.801311
numa16  128  off 70.642176
numa16  128  on 74.301325
numa16  1  off 1.453782
numa16  1  on 1.460134
numa16  64  off 84.581887
numa16  64  on 87.187862
numa16  8  off 10.991734
numa16  8  on 11.330359
numa4  128  off 71.619885
numa4  128  on 75.336443
numa4  1  off 1.427726
numa4  1  on 1.463109
numa4  64  off 84.894327
numa4  64  on 87.784820
numa4  8  off 11.248773
numa4  8  on 11.514226
pgproc16  128  off 69.251663
pgproc16  128  on 72.836041
pgproc16  1  off 1.404562
pgproc16  1  on 1.450629
pgproc16  64  off 82.583295
pgproc16  64  on 86.340406
pgproc16  8  off 10.931756
pgproc16  8  on 10.876293
pgproc4  128  off 70.943775
pgproc4  128  on 71.220691
pgproc4  1  off 1.413467
pgproc4  1  on 1.447225
pgproc4  64  off 83.706755
pgproc4  64  on 87.837412
pgproc4  8  off 10.717336
pgproc4  8  on 11.424055
sweep16  128  off 65.516788
sweep16  128  on 69.852675
sweep16  1  off 1.265357
sweep16  1  on 1.478456
sweep16  64  off 82.351384
sweep16  64  on 85.657325
sweep16  8  off 10.134920
sweep16  8  on 10.709992
sweep4  128  off 58.171417
sweep4  128  on 71.074758
sweep4  1  off 1.283089
sweep4  1  on 1.468863
sweep4  64  off 85.115087
sweep4  64  on 82.245140
sweep4  8  off 10.214124
sweep4  8  on 10.984077

bench_numa.sh
Description: application/shellscript

numa16  1024  off 156531
numa16  1024  on 203669
numa16  128  off 304947
numa16  128  on 341334
numa16  1  off 5954
numa16  1  on 3785
numa16  64  off 320373
numa16  64  on 350338
numa16  8  off 47256
numa16  8  on 51998
numa4  1024  off 164933
numa4  1024  on 221202
numa4  128  off 306432
numa4  128  on 340640
numa4  1  off 5815
numa4  1  on 3786
numa4  64  off 319466
numa4  64  on 350299
numa4  8  off 47325
numa4  8  on 51851
sweep16  1024  off 177348
sweep16  1024  on 230097
sweep16  128  off 308967
sweep16  128  on 344263
sweep16  1  off 5978
sweep16  1  on 3630
sweep16  64  off 323757
sweep16  64  on 354071
sweep16  8  off 47575
sweep16  8  on 52821
sweep4  1024  off 182788
sweep4  1024  on 228218
sweep4  128  off 308046
sweep4  128  on 344065
sweep4  1  off 5972
sweep4  1  on 3632
sweep4  64  off 321674
sweep4  64  on 355082
sweep4  8  off 47785
sweep4  8  on 52357
master  1024  off 174692
master  1024  on 217981
master  128  off 306671
master  128  on 341712
master  1  off 6006
master  1  on 3718
master  64  off 321482
master  64  on 352603
master  8  off 47039
master  8  on 51537
pgproc16  1024  off 159239
pgproc16  1024  on 207525
pgproc16  128  off 306219
pgproc16  128  on 342183
pgproc16  1  off 5954
pgproc16  1  on 3782
pgproc16  64  off 319485
pgproc16  64  on 351629
pgproc16  8  off 47434
pgproc16  8  on 52259
pgproc4  1024  off 158922
pgproc4  1024  on 212753
pgproc4  128  off 307103
pgproc4  128  on 342786
pgproc4  1  off 5924
pgproc4  1  on 3757
pgproc4  64  off 321649
pgproc4  64  on 352308
pgproc4  8  off 47717
pgproc4  8  on 52292

Re: Adding basic NUMA awareness

Reply via email to