On 6/25/26 14:19, Jakub Wartak wrote:
> On Wed, Jun 24, 2026 at 10:29 PM Tomas Vondra <[email protected]> wrote:
> 
> Hi,
> 
>> Here's an updated patch series, with only minor changes to fix the mbind
>> issues:
> [..]
>> I've also included Jakub's "goodies" patch with the additional GUCs.
>> Those seem potentially useful to development.
> 
> Cool!
> 
>> I have some results from a new round of benchmarks, and it's a bit
>> disappointing. Or rather, there seem to be some issues that I can't
>> figure out, causing regressions.
> [..]
>> This chart is for median latency (in milliseconds):
>>
>>   clients       master     0003      0004    0003/on    0004/on
>>   -------------------------------------------------------------
>>         1        12767    12582     14509      12807      15307
>>         8        14383    14355     14149      14069      16165
>>        32        14756    15198     14836      14984      17128
>>        --------------------------------------------------------
>>         1                  103%      114%       100%       120%
>>         8                  101%       98%        98%       112%
>>        32                  102%      101%       102%       116%
>>
> 
> I haven't tried it yet, however I can spot some things:
> 
> No crystal clear idea why, but in the script I can see that you have
> io_method = io_uring and are not dropping_caches, so IMHO it is too complex
> interaction at this stage.
> 

By caches I assume you mean page cache? The test is meant so simulate a
cached system, copying data between shared buffers and page cache. My
expectation is that once we start hitting I/O, it'll completely hide
most differences due to NUMA.

> One hint: such setup is going to be problematic for proving numbers.
> On the meeting I've tried to describe that I've been using io_method = sync
> instead of 'worker' to get more predicitable results (together with echo 3
>> drop_caches), because then it is that backend's CPU/$NODE doing that
> pread()/pwrite() -- or any other operating performing the load --
> it is going to put that file onto that_specific_$NODE --
> so even if you have sequence like:
>     pgbench -i
>     pg_ctl restart
>     pgbench -c XX
> 

Hmm, I missed that point during the meeting. I wonder if "worker" is
interacting with the NUMA somehow (I mean, does it load it into the
right node?). But I'm using io_uring, and it's not clear to me why sync
would be better for benchmarking?

Ultimately, we need to make sure it works well with io_uring anyway,
right? Even if "sync" happens to be better for benchmarking (or even for
NUMA stuff), we have to make it work with worker/io_uring. Because
that's what practical systems use.

> then pgbench -i even with shared_buffers_numa=on will spread into many
> nodes the Buffers, yet after the restart the VFS cache portion of the data
> will still reside on single specific $NODE that wrote it to the filesystem
> (due to local-first-tocuh-affinity even for VFS cache), so any further 
> reading:
> VFS cache --pread()--> s_b will take the hit of remote interconnect with
> some probablity depending on where the new backends are running. Also
> with worker it is even worse as we have those memory queue in between. I
> think we even can have this:
> 
> file in VFS cache @ node0   --because of first touch policy (pgbench 
> -i/prewarm)
> io worker @ node1           --hits latency from node0 and node2
> shm io worker queue @ node2 --well
> client backend @ node 3     --puts into shm io worker on node2
> 
> Therefore I'm sticking to 'sync' to ease the pain... but with uring, I suspect
> the situation is kind of similiar as we call io_uring_submit(), and we
> may endup using io-wq kernel threads, and we have those submission/receive
> (memory) queues that are located somewhere (that is on some node) too.
> 
> I think, we simply lack affinity for IO/NUMA for all io modes except sync, but
> it's too early I suspect and way outside of scope for this $thread. I've
> started thinking about it just last week, so... (but hopefully I'll be able
> to ship helper fscachenuma.c to show layout of file across VFS caches on nodes
> next week I hope)
> 

Ah, you're suggesting the page cache stuff will be placed on a single
NUMA node? That may be true, it's a good point. And maybe it could skew
the results in a bad way. Still, that would be the case even without the
NUMA partitioning, no?

> Maybe some other suggestions:
> 
> Q1) Maybe some crosschecks first?
>        # balance should be equal between nodes even for baseline
>        # linux kernel has tendency to fit shm into one if it fits
>        find /sys/devices/system/node*/ -name 'free_hugepages' -exec
> grep -H . {} \;
> 
>        # check N0 and N1 even for default policy, might also reveal imbalance
>        # lots of RAM and too big huge_pages allows fitting whole shm
> into just N0
>        # see point 4 from [1]
>        grep /anon_h /proc/$SOMEREALBACKENDPID/numa_maps
> 
>        # then during pgbench -c run maybe those:
>        mpstat -N ALL 1
>        perf stat -a -e uncore_imc/cas_count_read/,uncore_imc/cas_count_write/ 
> \
>           --per-socket -I 1000  # or -M
> memory_bandwidth_read,memory_bandwidth_write
> 
>     (it might reveal that problem I've described above about io_method:
>     even with pgbench -c 1 you might be reading from all sockets/wrong sockets
>     instead of the correct one with affinity)
> 

I'll try, but if you could try running some experiments on your own,
that might be helpful.

>     I like to pin CPUs to just one node for pgbench -c
> <NUMBER_OF_CPUS/NUMBER_OF_NODES>
>     [to saturate one node only] and start server also with CPU pining
>     [or use this debug_numa_node to force] to that one node and cross-check
>     what's being read (using perf) and usually I have to disarm clock 
> balancing
>     and override weights using pg_buffercache_set_partition() to also force
>     weight to stay local only - only then I'm able to outrun master. That's
>     how this idea was born that if we are only working on node $N with
> some relations
>     then let's use only node $N's Buffers. But I have 90us:~280us
> local vs remote
>     latency, so it's probably way easier for me to see results even without
>     disabling CPU-idle-states/turboboost/etc.
> 
> Q2) Dunno, but 0007 is not changing anything in runtime and you get huge
>     discrepeancy results when going 0006 -> 0007 for clients=1 (see
> 128% -> 112%).
>     Literally, as the same code but different rebuild (ELF image)
> would be having
>     vastly different layout enough to cause perf issues?
> 
> Hopefully next week I'll try to repro those numbers to see if I can
> help more.
> 

Thank you! That'd be great.


regards

-- 
Tomas Vondra



Reply via email to