Re: Adding basic NUMA awareness

Jakub Wartak Wed, 21 Jan 2026 02:30:40 -0800

On Thu, Jan 15, 2026 at 12:26 AM Tomas Vondra <[email protected]> wrote:
>
[..]
>
> Anyway, my plan at this point is to revive the old patch (before
> changing direction to the simple patch), and see if we can observe a
> difference on the "right" hardware. Maybe some of the results with no
> improvements were due to this. This workload seems much more realistic.
>


I think I have an answer why Your patch is misbehaving, but I might be wrong.
First my 4s16c64t hw numbers from master back from ~21st Nov 2025 (so here
without Your's patch), s_b=8GB, huge_pages, pgbench -i scale 150 (so it flies
under the radar NBuffers/4), which gave me ~1922MB pgbench_accounts, and then
I do the select sum(abalance):

--membind=0 --cpunodebind=0 latency average = 2468.321 ms, stddev = 0.479 ms
S0 @ 825MB/s (uncore_imc/cas_count_read/)

--membind=0 --cpunodebind=1 latency average = 2780.653 ms, stddev = 2.080 ms
S0 @ 729MB/s (uncore_imc/cas_count_read/)

(2 socket hops as old UPI/QPI had max 2 interlinks)
--membind=0 --cpunodebind=2 latency average = 2811.500 ms stddev = 1.958 ms
--membind=0 --cpunodebind=3 latency average = 2777.305 ms stddev = 1.314 ms

So in ideal conditions I should be getting a 13-14% boost if all is
well. However
with the patchset (v20251121, debug_numa='buffers,procs'), it gets somewhat
worse numbers (worse than above than on master).

--membind=0 --cpunodebind=0 (we know that patchset code will
"interleave" anyway)
latency average = 2885.806 ms stddev = 20.349 ms
and we are reading RAM from four sockets, each @ 170-180 MB/s (total ~720MB/s)

The patch spread the 1922MB relation (which would fit into one node) into many:

postgres# select numa_node, count(*) from pg_buffercache_numa where bufferid in
  (select bufferid from pg_buffercache  where relfilenode =
  (select relfilenode from pg_class where relname = 'pgbench_accounts'))
group by 1 order by 2;
 numa_node | count
-----------+-------
         2 | 55404
         3 | 55405
         1 | 55415
         0 | 79678

Also the pg_buffercache_partitions.weights indicates "{24,24,24,24}" (%) split.

Now if I do this command-trick `migratepages(1) <pid> 1-3 0`, it
really disarms our
patchset semi-manual-interleave (numa_maps bind:1-3 to real numa 0) and it
*does* restore performance from 2900ms to 2600ms, therefore it just means it is
non optimal memory placement (by clocksweep balancing).

So the question is why such a table (1922MB with 245902 relpages) would end up
spreading so quickly to the other sockets? I've assumed that hitting 4 sockets
instead of 1 will be too slow and disarm the patchset partitioned clocksweep
balancing like below:
@@ -497,7 +497,9 @@ StrategyGetBuffer(BufferAccessStrategy strategy, [...]
-       sweep = ChooseClockSweep(true);
+       sweep = ChooseClockSweep(false);

and this in next tries gets me perfect (thanks to no balancing) weight there:
postgres=# select partition, numa_node, total_allocs, num_allocs, weights
from pg_buffercache_partitions;
 partition | numa_node | total_allocs | num_allocs |   weights
-----------+-----------+--------------+------------+-------------
         0 |         0 |       246186 |          0 | {100,0,0,0}
         1 |         1 |            0 |          0 | {0,100,0,0}
         2 |         2 |            0 |          0 | {0,0,100,0}
         3 |         3 |            0 |          0 | {0,0,0,100}

which yields (compare to initial measurements of 2468 .. 2780.. 2811ms):
latency average = 2737.440 ms
latency stddev = 5.715 ms
socket 0 reliably outputs 725MB/s constant (other are idle as expected)

I think we may simply need to (re?)think of strategy on how/when to distribute,
because even If I fill just 112 MB of data fresh after startup, weight will
change from 100,0,0,0...

postgres=# create table tmp1 as select id, repeat('A', 1024) t
from generate_series(1, 100000) as id;
SELECT 100000
-- 112 MB
postgres=# \dt+ tmp1
[..]

and then scan it, I'm already having just 62% of local affinity (again
my whole s_b is 8GB, and per-NUMA-node is like 2GB, so we are under
radar of NBuffers/4 too):

postgres=# select partition, numa_node, total_allocs, num_allocs, weights
from pg_buffercache_partitions;
 partition | numa_node | total_allocs | num_allocs |    weights
-----------+-----------+--------------+------------+---------------
         0 |         0 |         2587 |          1 | {62,12,12,12}
         1 |         1 |            0 |          0 | {0,100,0,0}
         2 |         2 |            0 |          0 | {0,0,100,0}
         3 |         3 |            0 |          0 | {0,0,0,100}

-- double-confirmation:
postgres=# select numa_node, count(*), count(*) * 100.0 /
sum(count(*)) OVER() AS pct
from pg_buffercache_numa where bufferid in (
        select bufferid from pg_buffercache where relfilenode =
            (select relfilenode from pg_class where relname = 'tmp1')
) group by 1 order by 2;
 numa_node | count |         pct
-----------+-------+---------------------
         3 |  1680 | 11.7130307467057101
         2 |  1683 | 11.7339468730391132
         1 |  1690 | 11.7827511678170536
         0 |  9290 | 64.7702712124381231

so of course during the scanning in loop, we are stressing all the sockets here
pretty much. It gets even worse from there, as if I use multiple tmp* tables
like that from a single backend (but total size << 1GB), I end up with
"{24,24,24,24}" split (but all of them would fit my node).

Of course all of the above was written with assumption for getting
most of the latency ,
single backend and not having backends from different nodes.

But now If I would hypothetically benchmark pgbench -S on master(from Nov)
against Yourpatchset from back then, with low number of backends I would
be comparing single-node-hugepage allocation (on random node, because
it would fit) vspatchset doing interleaving memory. But if kernel would migrate
all those backends (in case of master) toward the node where most of
s_b is located,
Your patchset simply couldn't win this.

This brings me to a point where I'm suspicious of this clock-sweep balancing
idea (partitioning is fine, it's just the balancing which seems to be kicking
too prematurely). BTW: much earlier You seem to have benchmarked

My thoughts for today are like following how it should work if You want to have
"demonstrate any benefits on other workloads":

1. We should do not pin backends to specific CPU/numa nodes, as the kernel
should be free to move the processes closer to the data more requested (it
knows the state of memory transfers and CPU util% across nodes better than
we do).
- we query for sched_getcpu() and get node
- we stick to PgProc and Buffers from that node and that's all
(we seem to be doing just that in the patchset, great!)

2. We then should try to stick to the local node as much as possible till it's
almost full and maybe only then try to start using remote memory as a
last resort. Or maybe
even try to avoid it at all costs.
- wouldn't eviction (as the first baby step) be more preferable rather
than using
  remote memory?
- we get local affinity boost, so no distribution to the other partitions
  (unless absolutely necessary?)
- ring buffers should protect somehow with one backend filling all RAM memory
  on the node  (well except pg_prewarm? maybe we should adjust it so
it intentionally
  interleaves from start?)
  -- related: NBuffers / 4 magic number drives me nuts, but maybe we should
     tweak to take into account the number of nodes too (NBuffers / nodes / 4?)
     to avoid filling whole node
- we would get more I/O as there would be potentially lower chance to find data
  in memory on that node(?), so we would need somehow to counter this
- maybe that's a little bit sci-fi, or I'm going to be flamed here for it,
  but we could potentially track local vs remote usagecount (Buffer state
  seems to be using 54 bits, so we could add 4 bits more to track "remote"
  access, AKA usagecount_remote? but we seem to not have space to track the
  exact origin/remote node). If we would track local vs remote, we would have
  have some input as if interleaving makes sense or not, wouldn't we?
  (that would somehow be tracked on per relation/"blockset", just food
for thought,
  dunno if we even have infra for that)

-J.

Re: Adding basic NUMA awareness

Reply via email to