On Mon, Feb 9, 2026 at 7:11 PM Jakub Wartak <[email protected]> wrote: > > On Wed, Jan 28, 2026 at 2:19 PM Ashutosh Bapat > <[email protected]> wrote: > > >v 20260128*.patch > > Short intro: I've started trying out these patches for slightly another reason > than the online buffers resize. There's was recent post [1] that was brought > to > attention by Alvaro. That article is complaining about postmaster being > unscalable and more or less saturating @ 2-3k new connections / second and > postmaster becoming a CPU hog (one could argue that's too much and not > sensible > setup). > > I've thought that the potential main reason of the hit would be slow fork(), > so I had an idea why we fork() with majority of memory being shared_buffers > (BufferBlocks) that is not really used inside postmaster itself > (I mean it does not use it, only backends do use it). I've thought it could > be cool if we could just init the memory, leave just the fd from memfd_create > for s_b around (that is unmap() BufferBlocks from the postmaster thus lowering > its RSS/smaps footprint) and then on fork() the fork() would NOT have to copy > that big kernel VMA for shared_buffers. Instead (in theory - only the fd that > is the reference - thereby we could increase the scalability of the > postmaster > (kernel would need to perform less work during fork()). Later on, the classic > backends on their side would mmap() the region back from the fd created > earlier > (in postmaster) using memfd_create(2), but that would happen as part of many > backends (so workload would be spread across many CPUs). The critical > assumption here is that although on Linux there seems to be huge PMD sharing > for > MAP_SHARED | MAP_HUGETLB, I was still wondering if we couldn't accelerate it > further by simply not having at all this memory before calling fork(). > Initially I've created simple PoC bench on 64GB even with hugepages showed > some > potential: > Scenario 1 (mmap inherited): 20001 total forks, 0.302ms per fork > Scenario 2 (MADV_DONTFORK): 20001 total forks, 0.292ms per fork > Scenario 3 (memfd_create): 20002 total forks, 0.145ms per fork > > Quite unexpectedly that's how I discovered Your's and Dimitry's patch > as it already > had separation of memory segments (rather than one big mmap() blob) and > memfd_create(2) used too, so I just gave it a try. So I've tried to benchmark > Your's patchset when it comes to establishing new connections: > > 1s4c 32GB RAM, 6.14.x kernel, 16GB shared_buffers > benchmark: /usr/pgsql19/bin/pgbench -n --connect -j 4 -c 100 > -f <(echo "SELECT 1;") postgres -P 1 -T 30 > > # master > latency average = 358.681 ms > latency stddev = 225.813 ms > average connection time = 2.989 ms > tps = 1329.733460 (including reconnection times) > > # memfd/thispatchset > latency average = 363.584 ms > latency stddev = 230.529 ms > average connection time = 3.022 ms > tps = 1315.810761 (including reconnection times) > > # memfd+mytrick, showed some promise in low stddev, but not in TPS > latency average = 34.229 ms > latency stddev = 22.059 ms > average connection time = 2.908 ms > tps = 1369.785773 (including reconnection times) > > Another box, 4s32c64, 128GB RAM, 6.14.x kernel, > 64GB shared_buffers (4 NUMA nodes) > > benchmark: /usr/pgsql19/bin/pgbench -n --connect -j 128 -c 1000 > -f <(echo "SELECT 1;") postgres -P 1 -T 30 > > #master > latency average = 240.179 ms > latency stddev = 119.379 ms > average connection time = 62.049 ms > tps = 2058.434343 (including reconnection times) > > #memfd > latency average = 268.384 ms > latency stddev = 133.501 ms > average connection time = 69.081 ms > tps = 1847.422995 (including reconnection times) > > #memfd+mytrick > latency average = 261.726 ms > latency stddev = 130.161 ms > average connection time = 67.579 ms > tps = 1889.988400 (including reconnection times) >
Thanks for the benchmarks. I can see
1. There's isn't much impact of having multiple segments on new connection time.
2. fallocate seems to be behind the regression on machine with 4 NUMA nodes.
Am I reading it correctly?
The latest patches 20260209 use only two segments. Please check if
that improves the situation further.
> So:
> a) yes, my idea fizzled - still no crystal clear idea why - but at least
> I've tried Your's patch :) We are still in the ballpark of ~1800..3000
> new connections per second.
>
> and here proper review against patchset follows:
> b) the patch changes the behavior on startup and it appears that now
> the patch tries to touch all the memory during startup which takes
> much more time (I'm thinking of HA failover/promote scenarios where
> long startup on could mean trouble e.g. after pg_rewind). E.g. without
> patch it takes 1-2s and with the patch it takes 49s, no HugePages with
> 64GB s_b on slow machine). It happens due to that new fallocate() from
> shmem_fallocate(). If it is supposed to stay like that IMHO log should
> elog() what it is doing ("allocating memory...", otherwise users can
> be left confused. It almost behaves like MAP_POPULATE would be
> used.
>
> c) as per above measurements, on NUMA it appears that there's seems be
> like 1847/2058=~89% of baseline regression, when it comes to the
> establishing new connections and you are operating on sysv_shmem.c
> (so affecting all users). Possibly this would have to be re-tested
> on some more modern hardware (I don't see it on single socket, but I
> see on multiple sockets)
I have added a TODO in the code to investigate this case later as we
fine tune the code.
>
> d) MADV_HUGEPAGES is Linux 4.14+ and although released nearly 10
> years ago the buildfarm probably has some animals (Ubuntu 16?) that
> still use such
> old kernels (??))
>
> e) so maybe because of b+c+d we should consider putting it under some new
> shared_memory_type in the long run?
That may be a good idea so as to avoid hitting segfault at run time
because of lack of memory to back the shared memory.
>
> e) With huge_pages=on and no asserts it seemed to never work for me due to:
> FATAL: segment[main]: could not truncate anonymous file to
> size 313483264: Invalid argument
> and please see this (this is with both(!)
> max_shared_buffers=shared_buffers=1GB),
> for some reason ftruncate() ended up calling ~ 2x more.
> [pid 1252287] memfd_create("main", MFD_HUGETLB) = 4
> [pid 1252287] mmap(NULL, 157286400, PROT_NONE, MAP_SHARED|MAP_NORESE..
> [pid 1252287] mprotect(0x7f2a1a400000, 157286400, PROT_READ|PROT_WRI..
> [pid 1252287] ftruncate(4, 313483264) = -1 EINVAL (Invalid argument)
> it appears that I'm getting this due to bug in
> round_off_mapping_sizes_for_hugepages() as before it I'm getting:
> shmem_reserved=156196864, shmem_req_size=156196864
> and after it it's called it returning:
> shmem_reserved=157286400, shmem_req_size=313483264
> Maybe TYPE ALIGN() would be a better fit for this there.
>
I see the bug. Fixed in the attached diff. Please apply it on top of
20260209 and let me know if it fixes the issue for you. I will include
it in the next set of patches.
--
Best Wishes,
Ashutosh Bapat
huge_page_fix.diff.no_ci
Description: Binary data
