On Mon, Feb 9, 2026 at 7:11 PM Jakub Wartak
<[email protected]> wrote:
>
> On Wed, Jan 28, 2026 at 2:19 PM Ashutosh Bapat
> <[email protected]> wrote:
>
> >v 20260128*.patch
>
> Short intro: I've started trying out these patches for slightly another reason
> than the online buffers resize. There's was recent post [1] that was brought 
> to
> attention by Alvaro. That article is complaining about postmaster being
> unscalable and more or less saturating @ 2-3k new connections / second and
> postmaster becoming a CPU hog (one could argue that's too much and not 
> sensible
> setup).
>
> I've thought that the potential main reason of the hit would be slow fork(),
> so I had an idea why we fork() with majority of memory being shared_buffers
> (BufferBlocks) that is not really used inside postmaster itself
> (I mean it does not use it, only backends do use it). I've thought it could
> be cool if we could just init the memory, leave just the fd from memfd_create
> for s_b around (that is unmap() BufferBlocks from the postmaster thus lowering
> its RSS/smaps footprint) and then on fork() the fork() would NOT have to copy
> that big kernel VMA for shared_buffers. Instead (in theory - only the fd that
> is the reference  - thereby we could increase the scalability of the 
> postmaster
> (kernel would need to perform less work during fork()). Later on, the classic
> backends on their side would mmap() the region back from the fd created 
> earlier
> (in postmaster) using memfd_create(2), but that would happen as part of many
> backends (so workload would be spread across many CPUs). The critical
> assumption here is that although on Linux there seems to be huge PMD sharing 
> for
> MAP_SHARED | MAP_HUGETLB, I was still wondering if we couldn't accelerate it
> further by simply not having at all this memory before calling fork().
> Initially I've created simple PoC bench on 64GB even with hugepages showed 
> some
> potential:
>     Scenario 1 (mmap inherited): 20001 total forks, 0.302ms per fork
>     Scenario 2 (MADV_DONTFORK): 20001 total forks, 0.292ms per fork
>     Scenario 3 (memfd_create): 20002 total forks, 0.145ms per fork
>
> Quite unexpectedly that's how I discovered Your's and Dimitry's patch
> as it already
> had separation of memory segments (rather than one big mmap() blob) and
> memfd_create(2) used too, so I just gave it a try.  So I've tried to benchmark
> Your's patchset when it comes to establishing new connections:
>
> 1s4c 32GB RAM, 6.14.x kernel, 16GB shared_buffers
> benchmark: /usr/pgsql19/bin/pgbench -n --connect -j 4 -c 100
>     -f <(echo "SELECT 1;") postgres -P 1 -T 30
>
> # master
>     latency average = 358.681 ms
>     latency stddev = 225.813 ms
>     average connection time = 2.989 ms
>     tps = 1329.733460 (including reconnection times)
>
> # memfd/thispatchset
>     latency average = 363.584 ms
>     latency stddev = 230.529 ms
>     average connection time = 3.022 ms
>     tps = 1315.810761 (including reconnection times)
>
> # memfd+mytrick, showed some promise in low stddev, but not in TPS
>     latency average = 34.229 ms
>     latency stddev = 22.059 ms
>     average connection time = 2.908 ms
>     tps = 1369.785773 (including reconnection times)
>
> Another box, 4s32c64, 128GB RAM, 6.14.x kernel,
> 64GB shared_buffers (4 NUMA nodes)
>
> benchmark: /usr/pgsql19/bin/pgbench -n --connect -j 128 -c 1000
>     -f <(echo "SELECT 1;") postgres -P 1 -T 30
>
> #master
>     latency average = 240.179 ms
>     latency stddev = 119.379 ms
>     average connection time = 62.049 ms
>     tps = 2058.434343 (including reconnection times)
>
> #memfd
>     latency average = 268.384 ms
>     latency stddev = 133.501 ms
>     average connection time = 69.081 ms
>     tps = 1847.422995 (including reconnection times)
>
> #memfd+mytrick
>     latency average = 261.726 ms
>     latency stddev = 130.161 ms
>     average connection time = 67.579 ms
>     tps = 1889.988400 (including reconnection times)
>

Thanks for the benchmarks. I can see
1. There's isn't much impact of having multiple segments on new connection time.
2. fallocate seems to be behind the regression on machine with 4 NUMA nodes.

Am I reading it correctly?

The latest patches 20260209 use only two segments. Please check if
that improves the situation further.

> So:
> a) yes, my idea fizzled - still no crystal clear idea why - but at least
>    I've tried Your's patch :) We are still in the ballpark of ~1800..3000
>    new connections per second.
>
> and here proper review against patchset follows:
> b) the patch changes the behavior on startup and it appears that now
>    the patch tries to touch all the memory during startup which takes
>    much more time (I'm thinking of HA failover/promote scenarios where
>    long startup on could mean trouble e.g. after pg_rewind). E.g. without
>    patch it takes 1-2s and with the patch it takes 49s, no HugePages with
>    64GB s_b on slow machine). It happens due to that new fallocate() from
>    shmem_fallocate(). If it is supposed to stay like that IMHO log should
>    elog() what it is doing ("allocating memory...", otherwise users can
>    be left confused. It almost behaves like MAP_POPULATE would be
>    used.
>
> c) as per above measurements, on NUMA it appears that there's seems be
>    like 1847/2058=~89% of baseline regression, when it comes to the
>    establishing new connections and you are operating on sysv_shmem.c
>    (so affecting all users). Possibly this would have to be re-tested
>    on some more modern hardware (I don't see it on single socket, but I
>    see on multiple sockets)

I have added a TODO in the code to investigate this case later as we
fine tune the code.

>
> d) MADV_HUGEPAGES is Linux 4.14+ and although released nearly 10
>    years ago the buildfarm probably has some animals (Ubuntu 16?) that
> still use such
>    old kernels (??))
>
> e) so maybe because of b+c+d we should consider putting it under some new
>    shared_memory_type in the long run?

That may be a good idea so as to avoid hitting segfault at run time
because of lack of memory to back the shared memory.

>
> e) With huge_pages=on and no asserts it seemed to never work for me due to:
>         FATAL:  segment[main]: could not truncate anonymous file to
>             size 313483264:  Invalid argument
>    and please see this (this is with  both(!)
> max_shared_buffers=shared_buffers=1GB),
>    for some reason ftruncate() ended up calling ~ 2x more.
>         [pid 1252287] memfd_create("main", MFD_HUGETLB) = 4
>         [pid 1252287] mmap(NULL, 157286400, PROT_NONE, MAP_SHARED|MAP_NORESE..
>         [pid 1252287] mprotect(0x7f2a1a400000, 157286400, PROT_READ|PROT_WRI..
>         [pid 1252287] ftruncate(4, 313483264)   = -1 EINVAL (Invalid argument)
>    it appears that I'm getting this due to bug in
>    round_off_mapping_sizes_for_hugepages() as before it I'm getting:
>         shmem_reserved=156196864, shmem_req_size=156196864
>    and after it it's called it returning:
>         shmem_reserved=157286400, shmem_req_size=313483264
>    Maybe TYPE ALIGN() would be a better fit for this there.
>

I see the bug. Fixed in the attached diff. Please apply it on top of
20260209 and let me know if it fixes the issue for you. I will include
it in the next set of patches.

-- 
Best Wishes,
Ashutosh Bapat

Attachment: huge_page_fix.diff.no_ci
Description: Binary data

Reply via email to