On Mon, Jun 29, 2026 at 9:42 AM Jakub Wartak <[email protected]> wrote: > > On Thu, Jun 25, 2026 at 3:49 PM Tomas Vondra <[email protected]> wrote: > > > > >> I have some results from a new round of benchmarks, and it's a bit > > >> disappointing. Or rather, there seem to be some issues that I can't > > >> figure out, causing regressions. > > > [..] > > >> This chart is for median latency (in milliseconds): > > >> > > >> clients master 0003 0004 0003/on 0004/on > > >> ------------------------------------------------------------- > > >> 1 12767 12582 14509 12807 15307 > > >> 8 14383 14355 14149 14069 16165 > > >> 32 14756 15198 14836 14984 17128 > > >> -------------------------------------------------------- > > >> 1 103% 114% 100% 120% > > >> 8 101% 98% 98% 112% > > >> 32 102% 101% 102% 116% > > >>
[..lots of variables..] > > I'll try, but if you could try running some experiments on your own, > > that might be helpful. > [..] > > > Hopefully next week I'll try to repro those numbers to see if I can > > > help more. > > > > > > > Thank you! That'd be great. > > Yeah, I'll try my best, we'll see how it goes. Right now I've just dropped > that fscachenuma proggie to aid us in troubleshooting. > > -J. > > [0] - https://github.com/jakubwartakEDB/fscachenuma Hi Tomas, OK, so I've run couple of tests and modified run.sh and also tried to fix some inefficiencies spotted while testing this. Note the attached performance matrix is in TPS (so more is better). Raw results/CSV and scripts are attached too. * run2 = 2 workloads, partitioned pgbench_accounts * run3 = just pgbenchS w/o partitioning + warmup * run4 = semi-like pgbenchS w/o partitioning but 100k rows + warmup One important modification in those run shell scripts is that they clean page-cache (drop_cached) as mentioned earlier to avoid false results where everything would on node#N after pgbench -i ran. Probably I did not get any regressions you've got, because of this. Or better diff -u run*.sh scripts. The "inst-optimized" is just the same patchset (so "inst-patchset") + crude attempt in 0008 to make further smooth out things and avoid regressions while I've been working on this. 0008 does couple of things: a. implements CPU/node caching instead quering it every single buffer. Even if on x86_64 that is optimized by vdso/kernel to avoid the real syscall, the semi-syscall tax seems to be visible when fetching lots of buffers. 128 is arbitrary and still kind of low (128*8kB=1MB, and we are doing hundreths of MB/s; while rescheduling happened only every couple of seconds). b1. minimize the attempt to use other partittions till some threshold ( and then it relies on the scan-all-partitions) b2. avoids selecting idle partitions (defined as avg_allocs/2) - if there are low allocations there it is debatable if cache utilization is better or sticking to lower latency is better (e.g. in some workloads buffer reuse is close to 0, so lower latency is clearly better) Results are attached, some observations: 0.There were vast differences in how pg_ctl is started (interleaved or not), so I've decided in the end to show relative to both situations. 1.In run2/seqconcurrscans I've saturated my interconnect and that's why it's giving 129-155% there. I don't have access physiscal hw, but I suspect that modern 2socket EPYC5 has like ~614GB/s per socket RAM bandwidth, but the max oneway bandwith of the interconnect is around ~220GB/s ( no way to provie it), so *IF* with hundreths of cores we would be able fetch at this rate we could saturte modern hardware too that way (and we birefly touched related topic: batched executor, accelerating it so fast those effects could be more easily achieveable) 2.run3 has no partiitioning because according to perf and my eyes, it spent time not on the buffers itself (thus it was way heavier on CPU [partitioning] than on memory...), so that's how run3 was born without partitions :D 3.The warmup is critical for run3/pgbenchS, as I've noticed that depending on ${luck} if you start the "master" (baseline w/o interleaving) and pgbench it right away everything might land on node0 (s_b, pagecache), so "master" was basically cheating in benchmarks vs especially Your's patchset where it was spreading way too soon. Having drop_caches, additional warump and only then proper pgbench kind of reduces that luck-factor. In general I think all runs with c=1 seem to have kind of low singal-to-noise ratio. I was thinking about pinning to always stick to the same NUMA node from start to win against master just for this c=1 scenarios, but "meh". 3b. in short for pgbench -S we can gain like 2-5% 4.run4 was made just to prove that workload fetching more buffers, than the standard pgbench -S (1 row?), seems to be the key to prove optimizations in 0008 (other than showing good benefits for seqconcurrscans of course). So run4 just shows benefit compared to 0001-0007 alone. Stil on the table: 1. maybe even better balancing is possible (?), but this one is seems enough? I'm out of other ideas, well other than the "shared-relation-use-by-foreign-node" idea described much earlier (but I won't be able to pull that off), so I'm not entering this rabbit hole any deeper. 2. Digging into io_method=worker optimizations (answering question: are they necessary?) Maybe I'll throw in run5 quite soon, this is going to be crucial to answer. 3. Potentially mentioned earlier BAS strategies (forcing just use of local partitions for known-to-be-only-local-users: CTAS/VACCUM/etc), but I'm afarid that's not for me as I would certainly break/violate some invisible to me boundary. Maybe You could run those run*.sh with master vs inst-patchset/optimized? (I'm not sure, maybe there's even different factor at play too...) -J.Title: Performance Evaluation Matrix
Table 1: Performance Relative to Master Default (100% Baseline)
Values within ±2% of the baseline are considered noise and are uncolored.
| Benchmark | Clients | master default (Baseline) | master interleave | optimized (numa=off, bal=off) | optimized (numa=on, bal=off) | optimized (numa=on, bal=on) | patched (numa=off, bal=off) | patched (numa=on, bal=off) | patched (numa=on, bal=on) |
|---|---|---|---|---|---|---|---|---|---|
| pgbenchS | 1 | 100.00% | 99.09% | 98.80% | 100.30% | 101.56% | 101.91% | 101.77% | 101.28% |
| pgbenchS | 8 | 100.00% | 100.34% | 99.64% | 100.06% | 99.80% | 100.78% | 100.87% | 101.24% |
| pgbenchS | 32 | 100.00% | 100.16% | 99.59% | 99.40% | 99.53% | 100.20% | 100.16% | 100.06% |
| seqconcurrscans | 1 | 100.00% | 77.29% | 100.68% | 97.11% | 102.78% | 97.76% | 95.93% | 75.08% |
| seqconcurrscans | 8 | 100.00% | 71.27% | 86.54% | 112.56% | 110.55% | 99.79% | 103.61% | 99.50% |
| seqconcurrscans | 32 | 100.00% | 94.33% | 100.38% | 118.67% | 122.60% | 109.97% | 108.72% | 107.00% |
Table 2: Performance Relative to Master Interleave (100% Baseline)
Values within ±2% of the baseline are considered noise and are uncolored.
| Benchmark | Clients | master interleave (Baseline) | master default | optimized (numa=off, bal=off) | optimized (numa=on, bal=off) | optimized (numa=on, bal=on) | patched (numa=off, bal=off) | patched (numa=on, bal=off) | patched (numa=on, bal=on) |
|---|---|---|---|---|---|---|---|---|---|
| pgbenchS | 1 | 100.00% | 100.92% | 99.71% | 101.22% | 102.49% | 102.84% | 102.70% | 102.21% |
| pgbenchS | 8 | 100.00% | 99.66% | 99.30% | 99.72% | 99.46% | 100.43% | 100.53% | 100.90% |
| pgbenchS | 32 | 100.00% | 99.84% | 99.44% | 99.25% | 99.37% | 100.04% | 100.00% | 99.90% |
| seqconcurrscans | 1 | 100.00% | 129.39% | 130.27% | 125.65% | 132.98% | 126.49% | 124.13% | 97.14% |
| seqconcurrscans | 8 | 100.00% | 140.31% | 121.42% | 157.93% | 155.12% | 140.01% | 145.37% | 139.60% |
| seqconcurrscans | 32 | 100.00% | 106.02% | 106.42% | 125.81% | 129.97% | 116.58% | 115.26% | 113.44% |
numabenchhackersreview-2026-06-30.tgz
Description: application/compressed-tar
Table 1: Performance Relative to Master Default (100% Baseline)
Values within ±2% of the baseline are considered noise and are uncolored.
| Benchmark | Clients | master default (Baseline) | master interleave | optimized (numa=off, bal=off) | optimized (numa=on, bal=off) | optimized (numa=on, bal=on) | patched (numa=off, bal=off) | patched (numa=on, bal=off) | patched (numa=on, bal=on) |
|---|---|---|---|---|---|---|---|---|---|
| pgbenchS | 1 | 100.00% | 118.07% | 115.29% | 98.55% | 102.76% | 115.50% | 98.88% | 113.40% |
| pgbenchS | 8 | 100.00% | 102.12% | 102.74% | 104.22% | 105.09% | 104.36% | 104.70% | 103.59% |
| pgbenchS | 32 | 100.00% | 100.15% | 98.98% | 98.67% | 99.27% | 100.38% | 100.32% | 100.51% |
Table 2: Performance Relative to Master Interleave (100% Baseline)
Values within ±2% of the baseline are considered noise and are uncolored.
| Benchmark | Clients | master interleave (Baseline) | master default | optimized (numa=off, bal=off) | optimized (numa=on, bal=off) | optimized (numa=on, bal=on) | patched (numa=off, bal=off) | patched (numa=on, bal=off) | patched (numa=on, bal=on) |
|---|---|---|---|---|---|---|---|---|---|
| pgbenchS | 1 | 100.00% | 84.69% | 97.64% | 83.46% | 87.03% | 97.82% | 83.74% | 96.04% |
| pgbenchS | 8 | 100.00% | 97.93% | 100.61% | 102.06% | 102.91% | 102.20% | 102.53% | 101.45% |
| pgbenchS | 32 | 100.00% | 99.85% | 98.84% | 98.53% | 99.12% | 100.23% | 100.17% | 100.36% |
From 8513d188ed5ed999e72fc3a58046bbc1ff9f5688 Mon Sep 17 00:00:00 2001 From: Jakub Wartak <[email protected]> Date: Tue, 30 Jun 2026 14:22:02 +0200 Subject: [PATCH v20260630-0008] clock-sweep: cached CPU/NUMA node and more locality-aware balancing Enhancements on top of 0001-0007, to have sligthly better NUMA locality and perfromance. 1. Cache numa_node_of_cpu()/sched_getcpu() per backend in ClockSweepPartitionIndex(), refreshing every CLOCKSWEEP_CPU_NODE_REFRESH allocations rather than on every call (visible hot buffer path in perf) 2. CLOCKSWEEP_BALANCE_THRESHOLD - make it less likely to redirect on any surplus of allocations (so scatter buffers LESS onto remote nodes). With this, it redirects its allocations to other (remote?) partitions when the allocation exceeds the per-partition average allocation rate by this percentage factor . 3. Avoid redirects to "idle" partitions: a redirect partition target must have some traffic which is at least 2x our demand. This elimnates cold partitions, but we can still reach them using scan-all-partitions fallback. --- src/backend/storage/buffer/freelist.c | 85 +++++++++++++++++++++++---- 1 file changed, 74 insertions(+), 11 deletions(-) diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c index e677c71e0b3..d64c2c67eb6 100644 --- a/src/backend/storage/buffer/freelist.c +++ b/src/backend/storage/buffer/freelist.c @@ -55,6 +55,9 @@ */ #define CLOCKSWEEP_HISTORY_COEFF 0.5 +/* How often backend should re-fetch the CPU/node on which it is running on? */ +#define CLOCKSWEEP_CPU_NODE_REFRESH 128 + /* * GUCs controlling the NUMA-aware clock-sweep behavior. * @@ -70,6 +73,7 @@ * clocksweep_scan_all_partitions - when enabled, looking for a free buffer * scans all clock-sweep partitions (in a round-robin way), not just the * backend's "home" partition. + * */ bool clocksweep_balance = true; bool clocksweep_balance_recalc = true; @@ -368,13 +372,29 @@ ClockSweepPartitionIndex(void) #ifdef USE_LIBNUMA if (shared_buffers_numa) { - int cpu; + /* + * Cache the CPU/NUMA node, refreshing only every CLOCKSWEEP_CPU_NODE_REFRESH + * allocations. It appears that sched_getcpu()/numa_node_of_cpu() are not free. + * On some platforms it take price of full system call, or the rest (x86_64?) + * is can be use VDSO optimization. The backend rarely migrates between NUMA + * nodes, and the balance logic only needs to notice migration after some time, + * so an occasional refresh is good enough. + */ + static int cached_node = -1; + static uint32 refresh_counter = 0; + + if (cached_node < 0 || (refresh_counter++ % CLOCKSWEEP_CPU_NODE_REFRESH) == 0) + { + int cpu; - /* XXX do we need to check sched_getcpu is available, somehow? */ - if ((cpu = sched_getcpu()) < 0) + /* XXX do we need to check sched_getcpu is available, somehow? */ + if ((cpu = sched_getcpu()) < 0) elog(ERROR, "sched_getcpu failed: %m"); - node = numa_node_of_cpu(cpu); + /* XXX/JW: use libnuma wrapper for this */ + cached_node = numa_node_of_cpu(cpu); + } + node = cached_node; } #endif @@ -768,7 +788,8 @@ StrategySyncBalance(void) uint32 total_allocs = 0, /* total number of allocations */ avg_allocs, /* average allocations (per partition) */ - delta_allocs = 0; /* sum of allocs above average */ + delta_allocs = 0, /* sum of allocs above average */ + redirect_cutoff; /* redirect only above this many allocs */ if (!clocksweep_balance || !clocksweep_balance_recalc) return; @@ -852,6 +873,20 @@ StrategySyncBalance(void) return; } + /* + * A partition only redirects allocations to other partitions when it + * exceeds the average by more than some threshold percent. + * Below this cutoff we keep allocations local, to preserve NUMA locality. + * + * TODO: maybe better value is possible. On 4s with 25 I've got good results, + * but with value of 50 I've got slight degradation. Maybe it should + * be equal to 100/numa_nodes ? + * + */ +#define CLOCKSWEEP_CUTOFF_THRESHOLD 25 + redirect_cutoff = avg_allocs + + (uint32) ((uint64) avg_allocs * CLOCKSWEEP_CUTOFF_THRESHOLD / 100); + /* * The actual rebalancing * @@ -884,10 +919,15 @@ StrategySyncBalance(void) /* reset the weights to start from scratch */ memset(balance, 0, sizeof(uint8) * MAX_BUFFER_PARTITIONS); - /* does this partition has fewer or more than avg_allocs? */ - if (allocs[i] < avg_allocs) + /* + * Does this partition exceed its fair share by more than the + * threshold? If not, keep all allocations local - redirecting them + * would push memory onto remote NUMA nodes for no real benefit when + * the load is already close to balanced. + */ + if (allocs[i] <= redirect_cutoff) { - /* fewer - don't redirect any allocations elsewhere */ + /* near fair share (or below) - keep allocations local */ balance[i] = 100; } else @@ -902,22 +942,45 @@ StrategySyncBalance(void) /* fraction of the "total" delta */ double delta_frac = (allocs[i] - avg_allocs) * 1.0 / delta_allocs; - /* keep just enough allocations to meet the target */ - balance[i] = (100.0 * avg_allocs / allocs[i]); + /* how much we keep local; we hand out the rest below */ + int kept = 100; /* redirect the extra allocations */ for (int j = 0; j < StrategyControl->num_partitions; j++) { /* How many allocations to receive from i-th partition? */ uint32 receive_allocs = delta_frac * (avg_allocs - allocs[j]); + int w; + + /* do not redirect to ourselves */ + if (j == i) + continue; /* ignore partitions that don't need additional allocations */ if (allocs[j] > avg_allocs) continue; + /* + * Only use other partitions that actually have demand of + * their own (avoid idle). If we fail, there's always the + * scan-all-partitions fallback. + * + * TODO:: just guessing,heuristics + */ + if (allocs[j] < (avg_allocs / 2)) + continue; + /* fraction to redirect */ - balance[j] = (100.0 * receive_allocs / allocs[i]) + 0.5; + w = (int) ((100.0 * receive_allocs / allocs[i]) + 0.5); + balance[j] = w; + kept -= w; } + + /* avoid negative balances */ + if (kept > 0) + balance[i] = kept; + else + balance[i] = 1; } /* combine the old and new weights (hysteresis) */ -- 2.43.0Title: Performance Evaluation Matrix
Table 1: Performance Relative to Master Default (100% Baseline)
Values within ±2% of the baseline are considered noise and are uncolored.
| Benchmark | Clients | master default (Baseline) | master interleave | optimized (numa=off, bal=off) | optimized (numa=on, bal=off) | optimized (numa=on, bal=on) | patched (numa=off, bal=off) | patched (numa=on, bal=off) | patched (numa=on, bal=on) |
|---|---|---|---|---|---|---|---|---|---|
| pgbenchS100krows | 1 | 100.00% | 91.38% | 93.38% | 98.57% | 94.32% | 81.72% | 84.25% | 76.65% |
| pgbenchS100krows | 8 | 100.00% | 100.98% | 100.79% | 100.75% | 102.13% | 87.25% | 87.75% | 87.32% |
| pgbenchS100krows | 32 | 100.00% | 100.79% | 101.87% | 103.10% | 103.76% | 90.77% | 92.42% | 92.02% |
Table 2: Performance Relative to Master Interleave (100% Baseline)
Values within ±2% of the baseline are considered noise and are uncolored.
| Benchmark | Clients | master interleave (Baseline) | master default | optimized (numa=off, bal=off) | optimized (numa=on, bal=off) | optimized (numa=on, bal=on) | patched (numa=off, bal=off) | patched (numa=on, bal=off) | patched (numa=on, bal=on) |
|---|---|---|---|---|---|---|---|---|---|
| pgbenchS100krows | 1 | 100.00% | 109.43% | 102.18% | 107.86% | 103.22% | 89.43% | 92.19% | 83.87% |
| pgbenchS100krows | 8 | 100.00% | 99.03% | 99.80% | 99.77% | 101.14% | 86.40% | 86.90% | 86.47% |
| pgbenchS100krows | 32 | 100.00% | 99.22% | 101.07% | 102.30% | 102.95% | 90.06% | 91.69% | 91.30% |
