Hi Tomas! [..] > Which I think is mostly the same thing you're saying, and you have the maps > to support it.
Right, the thread is kind of long, you were right back then, well but
at least we've got a solid explanation with data.
> Here's an updated version of the patch series.
Just for double confirmation, I've used those ones (v20251121*) and
they indeed interleaved parts of shm memory.
> It fixes a bunch of issues in pg_buffercache_pages.c - duplicate attnums
> and a incorrect array length.
You'll need to rebase again, pg_buffercache_numa got updated again on
Monday and clashes with 0006.
> The main change is in 0006 - it sets the default allocation policy for
> shmem to interleaving, before doing the explicit partitioning for shared
> buffers. It does it by calling numa_set_membind before the mmap(), and
> then numa_interleave_memory() on the allocated shmem. It does this to
> allow using MAP_POPULATE - but that's commented out by default.
>
> This does seem to solve the SIGBUS failures for me. I still think there
> might be a small chance of hitting that, because of locating an extra
> "boundary" page on one of the nodes. But it should be solvable by
> reserving a couple more pages.
I can confirm, never got any SIGBUS during the later described
benchmarks, so it's much better now.
> Jakub, what do you think?
On one side not using MAP_POPULATE gives instant startup, but on the
other it gives much better predictability latencies especially fresh
after starting up (this might matter to folks who like to benchmark --
us?, but initially I've just used it as a simple hack to touch
memory). I would be wary of using MAP_POPULATE with s_b when it would
be sized in hundreths of GBs, it could take minutes in startup, which
would be terrible if someone would hit SIGSEGV on production and
expect restart_after_crash=true to save him. I mean WAL redo crash
would be terrible, but that would be terrible * 2. Also pretty
long-term with DIO, we'll get much bigger s_b anyway (hopefully), so
it would hurt even more, so I think that would be a bad path(?)
I've benchmarked the thing in two scenarios (readonly pgbench < s_b
size across variations of code and connections and 2nd one with
seqconcurrrentscans) in solid stable conditions: 4s32c64t == 4 NUMA
nodes, 128GB RAM, 31GB shared_buffers dbsize ~29GB, 6.14.x, no idle
CPU states, no turbo boost, and so on, literally great home heater
when there's -3C outside!)
The data is baseline "100%" for master along with HP on/off (so it's
showing diff % from respective HP setting):
scenario I: pgbench -S
connections
branch HP 1 8 64 128 1024
master off 100.00% 100.00% 100.00% 100.00% 100.00%
master on 100.00% 100.00% 100.00% 100.00% 100.00%
numa16 off 99.13% 100.46% 99.66% 99.44% 89.60%
numa16 on 101.80% 100.89% 99.36% 99.89% 93.43%
numa4 off 96.82% 100.61% 99.37% 99.92% 94.41%
numa4 on 101.83% 100.61% 99.35% 99.69% 101.48%
pgproc16 off 99.13% 100.84% 99.38% 99.85% 91.15%
pgproc16 on 101.72% 101.40% 99.72% 100.14% 95.20%
pgproc4 off 98.63% 101.44% 100.05% 100.14% 90.97%
pgproc4 on 101.05% 101.46% 99.92% 100.31% 97.60%
sweep16 off 99.53% 101.14% 100.71% 100.75% 101.52%
sweep16 on 97.63% 102.49% 100.42% 100.75% 105.56%
sweep4 off 99.43% 101.59% 100.06% 100.45% 104.63%
sweep4 on 97.69% 101.59% 100.70% 100.69% 104.70%
I would consider everything +/- 3% as noise (technically each branch
was a different compilation/ELF binary, as changing this #define
required to do so to get 4 vs 16; please see attached script). I miss
the explanation why without HP it deteriorates so much with for c=1024
with the patches.
scenario II: pgbench -f seqconcurrscans.pgb; 64 partitions from
pgbench --partitions=64 -i -s 2000 [~29GB] being hammered in modulo
without PQ by:
\set num (:client_id % 8) + 1
select sum(octet_length(filler)) from pgbench_accounts_:num;
connections
branch HP 1 8 64 128
master off 100.00% 100.00% 100.00% 100.00%
master on 100.00% 100.00% 100.00% 100.00%
numa16 off 115.62% 108.87% 101.08% 111.56%
numa16 on 107.68% 104.90% 102.98% 105.51%
numa4 off 113.55% 111.41% 101.45% 113.10%
numa4 on 107.90% 106.60% 103.68% 106.98%
pgproc16 off 111.70% 108.27% 98.69% 109.36%
pgproc16 on 106.98% 100.69% 101.98% 103.42%
pgproc4 off 112.41% 106.15% 100.03% 112.03%
pgproc4 on 106.73% 105.77% 103.74% 101.13%
sweep16 off 100.63% 100.38% 98.41% 103.46%
sweep16 on 109.03% 99.15% 101.17% 99.19%
sweep4 off 102.04% 101.16% 101.71% 91.86%
sweep4 on 108.33% 101.69% 97.14% 100.92%
The benefit varies with like +3-10% depending on connection count.
Quite frankly I was expecting a little bit more, especially after
re-reading [1]. Maybe you preloaded it there using pg_prewarm? (here
I've randomly warmed it using pgbench). Probably it's something with
my test, I'll take yet another look hopefully soon. The good thing is
that it never crashed and I haven't seen any errors like "Bad address"
probably related to AIO as you saw in [1], perhaps I wasn't using
uring.
0007 (PROCs) still complains with "mbind: Invalid argument" (aligment issue)
-J.
[1] -
https://www.postgresql.org/message-id/e4d7e6fc-b5c5-4288-991c-56219db2edd5%40vondra.me
master 128 off 63.323369 master 128 on 70.424227 master 1 off 1.257394 master 1 on 1.355974 master 64 off 83.681932 master 64 on 84.667119 master 8 off 10.096653 master 8 on 10.801311 numa16 128 off 70.642176 numa16 128 on 74.301325 numa16 1 off 1.453782 numa16 1 on 1.460134 numa16 64 off 84.581887 numa16 64 on 87.187862 numa16 8 off 10.991734 numa16 8 on 11.330359 numa4 128 off 71.619885 numa4 128 on 75.336443 numa4 1 off 1.427726 numa4 1 on 1.463109 numa4 64 off 84.894327 numa4 64 on 87.784820 numa4 8 off 11.248773 numa4 8 on 11.514226 pgproc16 128 off 69.251663 pgproc16 128 on 72.836041 pgproc16 1 off 1.404562 pgproc16 1 on 1.450629 pgproc16 64 off 82.583295 pgproc16 64 on 86.340406 pgproc16 8 off 10.931756 pgproc16 8 on 10.876293 pgproc4 128 off 70.943775 pgproc4 128 on 71.220691 pgproc4 1 off 1.413467 pgproc4 1 on 1.447225 pgproc4 64 off 83.706755 pgproc4 64 on 87.837412 pgproc4 8 off 10.717336 pgproc4 8 on 11.424055 sweep16 128 off 65.516788 sweep16 128 on 69.852675 sweep16 1 off 1.265357 sweep16 1 on 1.478456 sweep16 64 off 82.351384 sweep16 64 on 85.657325 sweep16 8 off 10.134920 sweep16 8 on 10.709992 sweep4 128 off 58.171417 sweep4 128 on 71.074758 sweep4 1 off 1.283089 sweep4 1 on 1.468863 sweep4 64 off 85.115087 sweep4 64 on 82.245140 sweep4 8 off 10.214124 sweep4 8 on 10.984077
bench_numa.sh
Description: application/shellscript
numa16 1024 off 156531 numa16 1024 on 203669 numa16 128 off 304947 numa16 128 on 341334 numa16 1 off 5954 numa16 1 on 3785 numa16 64 off 320373 numa16 64 on 350338 numa16 8 off 47256 numa16 8 on 51998 numa4 1024 off 164933 numa4 1024 on 221202 numa4 128 off 306432 numa4 128 on 340640 numa4 1 off 5815 numa4 1 on 3786 numa4 64 off 319466 numa4 64 on 350299 numa4 8 off 47325 numa4 8 on 51851 sweep16 1024 off 177348 sweep16 1024 on 230097 sweep16 128 off 308967 sweep16 128 on 344263 sweep16 1 off 5978 sweep16 1 on 3630 sweep16 64 off 323757 sweep16 64 on 354071 sweep16 8 off 47575 sweep16 8 on 52821 sweep4 1024 off 182788 sweep4 1024 on 228218 sweep4 128 off 308046 sweep4 128 on 344065 sweep4 1 off 5972 sweep4 1 on 3632 sweep4 64 off 321674 sweep4 64 on 355082 sweep4 8 off 47785 sweep4 8 on 52357 master 1024 off 174692 master 1024 on 217981 master 128 off 306671 master 128 on 341712 master 1 off 6006 master 1 on 3718 master 64 off 321482 master 64 on 352603 master 8 off 47039 master 8 on 51537 pgproc16 1024 off 159239 pgproc16 1024 on 207525 pgproc16 128 off 306219 pgproc16 128 on 342183 pgproc16 1 off 5954 pgproc16 1 on 3782 pgproc16 64 off 319485 pgproc16 64 on 351629 pgproc16 8 off 47434 pgproc16 8 on 52259 pgproc4 1024 off 158922 pgproc4 1024 on 212753 pgproc4 128 off 307103 pgproc4 128 on 342786 pgproc4 1 off 5924 pgproc4 1 on 3757 pgproc4 64 off 321649 pgproc4 64 on 352308 pgproc4 8 off 47717 pgproc4 8 on 52292
