On Wed, Jun 24, 2026 at 10:29 PM Tomas Vondra <[email protected]> wrote:

Hi,

> Here's an updated patch series, with only minor changes to fix the mbind
> issues:
[..]
> I've also included Jakub's "goodies" patch with the additional GUCs.
> Those seem potentially useful to development.

Cool!

> I have some results from a new round of benchmarks, and it's a bit
> disappointing. Or rather, there seem to be some issues that I can't
> figure out, causing regressions.
[..]
> This chart is for median latency (in milliseconds):
>
>   clients       master     0003      0004    0003/on    0004/on
>   -------------------------------------------------------------
>         1        12767    12582     14509      12807      15307
>         8        14383    14355     14149      14069      16165
>        32        14756    15198     14836      14984      17128
>        --------------------------------------------------------
>         1                  103%      114%       100%       120%
>         8                  101%       98%        98%       112%
>        32                  102%      101%       102%       116%
>

I haven't tried it yet, however I can spot some things:

No crystal clear idea why, but in the script I can see that you have
io_method = io_uring and are not dropping_caches, so IMHO it is too complex
interaction at this stage.

One hint: such setup is going to be problematic for proving numbers.
On the meeting I've tried to describe that I've been using io_method = sync
instead of 'worker' to get more predicitable results (together with echo 3
> drop_caches), because then it is that backend's CPU/$NODE doing that
pread()/pwrite() -- or any other operating performing the load --
it is going to put that file onto that_specific_$NODE --
so even if you have sequence like:
    pgbench -i
    pg_ctl restart
    pgbench -c XX

then pgbench -i even with shared_buffers_numa=on will spread into many
nodes the Buffers, yet after the restart the VFS cache portion of the data
will still reside on single specific $NODE that wrote it to the filesystem
(due to local-first-tocuh-affinity even for VFS cache), so any further reading:
VFS cache --pread()--> s_b will take the hit of remote interconnect with
some probablity depending on where the new backends are running. Also
with worker it is even worse as we have those memory queue in between. I
think we even can have this:

file in VFS cache @ node0   --because of first touch policy (pgbench -i/prewarm)
io worker @ node1           --hits latency from node0 and node2
shm io worker queue @ node2 --well
client backend @ node 3     --puts into shm io worker on node2

Therefore I'm sticking to 'sync' to ease the pain... but with uring, I suspect
the situation is kind of similiar as we call io_uring_submit(), and we
may endup using io-wq kernel threads, and we have those submission/receive
(memory) queues that are located somewhere (that is on some node) too.

I think, we simply lack affinity for IO/NUMA for all io modes except sync, but
it's too early I suspect and way outside of scope for this $thread. I've
started thinking about it just last week, so... (but hopefully I'll be able
to ship helper fscachenuma.c to show layout of file across VFS caches on nodes
next week I hope)

Maybe some other suggestions:

Q1) Maybe some crosschecks first?
       # balance should be equal between nodes even for baseline
       # linux kernel has tendency to fit shm into one if it fits
       find /sys/devices/system/node*/ -name 'free_hugepages' -exec
grep -H . {} \;

       # check N0 and N1 even for default policy, might also reveal imbalance
       # lots of RAM and too big huge_pages allows fitting whole shm
into just N0
       # see point 4 from [1]
       grep /anon_h /proc/$SOMEREALBACKENDPID/numa_maps

       # then during pgbench -c run maybe those:
       mpstat -N ALL 1
       perf stat -a -e uncore_imc/cas_count_read/,uncore_imc/cas_count_write/ \
          --per-socket -I 1000  # or -M
memory_bandwidth_read,memory_bandwidth_write

    (it might reveal that problem I've described above about io_method:
    even with pgbench -c 1 you might be reading from all sockets/wrong sockets
    instead of the correct one with affinity)

    I like to pin CPUs to just one node for pgbench -c
<NUMBER_OF_CPUS/NUMBER_OF_NODES>
    [to saturate one node only] and start server also with CPU pining
    [or use this debug_numa_node to force] to that one node and cross-check
    what's being read (using perf) and usually I have to disarm clock balancing
    and override weights using pg_buffercache_set_partition() to also force
    weight to stay local only - only then I'm able to outrun master. That's
    how this idea was born that if we are only working on node $N with
some relations
    then let's use only node $N's Buffers. But I have 90us:~280us
local vs remote
    latency, so it's probably way easier for me to see results even without
    disabling CPU-idle-states/turboboost/etc.

Q2) Dunno, but 0007 is not changing anything in runtime and you get huge
    discrepeancy results when going 0006 -> 0007 for clients=1 (see
128% -> 112%).
    Literally, as the same code but different rebuild (ELF image)
would be having
    vastly different layout enough to cause perf issues?

Hopefully next week I'll try to repro those numbers to see if I can
help more.

-J.

[1] - 
https://www.postgresql.org/message-id/CAKZiRmzo9xnJSgO4b26DTZqPuObcQ-6ncay%2BmOEKs9rzCkegUA%40mail.gmail.com


Reply via email to