On Tue, Feb 10, 2026 at 7:17 AM Ashutosh Bapat
<[email protected]> wrote:
>
> On Mon, Feb 9, 2026 at 7:11 PM Jakub Wartak
> <[email protected]> wrote:

> > 1s4c 32GB RAM, 6.14.x kernel, 16GB shared_buffers
> > benchmark: /usr/pgsql19/bin/pgbench -n --connect -j 4 -c 100
> >     -f <(echo "SELECT 1;") postgres -P 1 -T 30
> >
> > # master
> >     latency average = 358.681 ms
> >     latency stddev = 225.813 ms
> >     average connection time = 2.989 ms
> >     tps = 1329.733460 (including reconnection times)
> >
> > # memfd/thispatchset
> >     latency average = 363.584 ms
> >     latency stddev = 230.529 ms
> >     average connection time = 3.022 ms
> >     tps = 1315.810761 (including reconnection times)
> >

> > Another box, 4s32c64, 128GB RAM, 6.14.x kernel,
> > 64GB shared_buffers (4 NUMA nodes)
> >
> > benchmark: /usr/pgsql19/bin/pgbench -n --connect -j 128 -c 1000
> >     -f <(echo "SELECT 1;") postgres -P 1 -T 30
> >
> > #master
> >     latency average = 240.179 ms
> >     latency stddev = 119.379 ms
> >     average connection time = 62.049 ms
> >     tps = 2058.434343 (including reconnection times)
> >
> > #memfd
> >     latency average = 268.384 ms
> >     latency stddev = 133.501 ms
> >     average connection time = 69.081 ms
> >     tps = 1847.422995 (including reconnection times)

Hi Ashutosh!

>
> Thanks for the benchmarks. I can see

> 1. There's isn't much impact of having multiple segments on new connection 
> time.

> The latest patches 20260209 use only two segments. Please check if
> that improves the situation further.

Well there was regression 2058 -> 1847 conns/s on previous patchset on
this legacy
NUMA box, but now it's appears to be gone thanks to probably just two
regions in v20260209
as I'm getting better results:
    latency average = 244.292 ms
    latency stddev = 121.505 ms
    average connection time = 62.973 ms
    tps = 2027.553831 (including reconnection times)

That box is legacy, slow, deprecated, Andres hates it, but it's
sometimes easier to spot such
things with the naked eye.

 > 2. fallocate seems to be behind the regression on machine with 4 NUMA nodes.

Yes,  that fallocate() on startup can take a lot of time (without HPs
here, 32GB):
[pid 948989] 15:16:49 ftruncate(5, 34359746560) = 0
[pid 948989] 15:16:49 fallocate(5, 0, 0, 34359746560.....................) = 0
[pid 948989] 15:17:10 fallocate(6, 0, 0, 209776) = 0

I'm kind of wondering and worried about this behavior e.g. on modern >=1TB RAM
machines (so 256GB for s_b) . If you would shield that as non-default
shared_memory_type then probably not all folks will be impacted, but only
those who prefer to have online resize of buffers.

> > and here proper review against patchset follows:
> > b) the patch changes the behavior on startup and it appears that now
> >    the patch tries to touch all the memory during startup which takes
> >    much more time (I'm thinking of HA failover/promote scenarios where
> >    long startup on could mean trouble e.g. after pg_rewind). E.g. without
> >    patch it takes 1-2s and with the patch it takes 49s, no HugePages with
> >    64GB s_b on slow machine). It happens due to that new fallocate() from
> >    shmem_fallocate(). If it is supposed to stay like that IMHO log should
> >    elog() what it is doing ("allocating memory...", otherwise users can
> >    be left confused. It almost behaves like MAP_POPULATE would be
> >    used.
> >
> > c) as per above measurements, on NUMA it appears that there's seems be
> >    like 1847/2058=~89% of baseline regression, when it comes to the
> >    establishing new connections and you are operating on sysv_shmem.c
> >    (so affecting all users). Possibly this would have to be re-tested
> >    on some more modern hardware (I don't see it on single socket, but I
> >    see on multiple sockets)
>
> I have added a TODO in the code to investigate this case later as we
> fine tune the code.

Well with the new patch version I think you can remove it. It think it should be
solved as per  above pgbench number (with v20260209/just 2 segments)
and the numbers from fork-microbenchmark in parallel reply to Andres.

> > d) MADV_HUGEPAGES is Linux 4.14+ and although released nearly 10
> >    years ago the buildfarm probably has some animals (Ubuntu 16?) that
> > still use such
> >    old kernels (??))
> >
> > e) so maybe because of b+c+d we should consider putting it under some new
> >    shared_memory_type in the long run?
>
> That may be a good idea so as to avoid hitting segfault at run time
> because of lack of memory to back the shared memory.

Not sure what you mean about that segfault, but right now it's just the
fallocate() that might cause a long startup time.

> > e) With huge_pages=on and no asserts it seemed to never work for me due to:
> >         FATAL:  segment[main]: could not truncate anonymous file to
> >             size 313483264:  Invalid argument
> >    and please see this (this is with  both(!)
> > max_shared_buffers=shared_buffers=1GB),
> >    for some reason ftruncate() ended up calling ~ 2x more.
> >         [pid 1252287] memfd_create("main", MFD_HUGETLB) = 4
> >         [pid 1252287] mmap(NULL, 157286400, PROT_NONE, 
> > MAP_SHARED|MAP_NORESE..
> >         [pid 1252287] mprotect(0x7f2a1a400000, 157286400, 
> > PROT_READ|PROT_WRI..
> >         [pid 1252287] ftruncate(4, 313483264)   = -1 EINVAL (Invalid 
> > argument)
> >    it appears that I'm getting this due to bug in
> >    round_off_mapping_sizes_for_hugepages() as before it I'm getting:
> >         shmem_reserved=156196864, shmem_req_size=156196864
> >    and after it it's called it returning:
> >         shmem_reserved=157286400, shmem_req_size=313483264
> >    Maybe TYPE ALIGN() would be a better fit for this there.
> >
>
> I see the bug. Fixed in the attached diff. Please apply it on top of
> 20260209 and let me know if it fixes the issue for you. I will include
> it in the next set of patches.

Yes, it fixes that "bug1", given
    shared_buffers = '32 GB'
    max_shared_buffers = '32 GB'
    max_connections = 1000
    huge_pages = 'on'

without it , it was:
    mmap(NULL, 35399925760, PROT_NONE,
MAP_SHARED|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7fad34c00000
    mmap(NULL, 1038090240, PROT_NONE,
MAP_SHARED|MAP_NORESERVE|MAP_HUGETLB, 4, 0) = 0x7facf6e00000
    ftruncate(4, 2074263552)                = -1 EINVAL (Invalid argument)

and with it:
    mmap(NULL, 1038090240, PROT_NONE,
MAP_SHARED|MAP_NORESERVE|MAP_HUGETLB, 4, 0) = 0x7fbf49a00000
    ftruncate(4, 1038090240)                = 0
    mmap(NULL, 34361835520, PROT_NONE,
MAP_SHARED|MAP_NORESERVE|MAP_HUGETLB, 5, 0) = 0x7fb749800000
    ftruncate(5, 34361835520)               = 0

So bug1 should be fixed. However there's something odd afterwards:

    postgres=# show huge_pages;
    huge_pages
    ------------
    on
    (1 row)
    postgres=# show huge_pages_status ;
    huge_pages_status
    -------------------
    off
    (1 row)

Crosschecking, shows it is true, no HP ended up being allocated:
    $ grep -A 2 /memfd /proc/775241/smaps # postmaster shows no HP
usage (so just 4kB pages):
    7f845b7f1000-7f8c5b7f3000 rw-s 00000000 00:01 28682
      /memfd:buffers (deleted)
    Size:           33554440 kB
    KernelPageSize:        4 kB
    --
    7f8c5b7f3000-7f8c9941f000 rw-s 00000000 00:01 28680
      /memfd:main (deleted)
    Size:            1011888 kB
    KernelPageSize:        4 kB

strace shows silent failure on startup:
    [pid 775320] mmap(NULL, 35399925760, PROT_NONE,
MAP_SHARED|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = -1 ENOMEM (Cannot
allocate memory)
    [pid 775320] memfd_create("main", 0)    = 4
    [pid 775320] mmap(NULL, 1036173312, PROT_NONE,
MAP_SHARED|MAP_NORESERVE, 4, 0) = 0x7fc3faff3000
    [pid 775320] ftruncate(4, 1036173312)   = 0
    [pid 775320] memfd_create("buffers", 0) = 5
    [pid 775320] mmap(NULL, 34359746560, PROT_NONE,
MAP_SHARED|MAP_NORESERVE, 5, 0) = 0x7fbbfaff1000
    [pid 775320] ftruncate(5, 34359746560)  = 0

and further tracking nailed it down to bug2 / silent failure in
CreateSharedMemoryAndSemaphores()->PrepareHugePages()

That must be some logic error there in the patch, because if I have
huge_pages=on I want it to fail to start instead of silenty fallback
to off in huge_pages_status.

And this shows another problem with calculating
shared_memory_size_in_huge_pages - it's wrong right now I think [bug3]. I have
used postgres -C shared_memory_size_in_huge_pages and have put this value in the
proper sysctl. It told me to use 16879 (which gives *2MB huge page size =
33758MBs = 35397828608 bytes, but mmap() wanted less 34359746560 and that's
~989MB difference and still it failed)

OK, sure, I'll throw some more huge pages huge pages (17879) instead but then
it still it cries with bug4 with even more strange another error:
2026-02-10 15:05:45.897 CET [775386] DEBUG:  reserving space: probe
mmap(35399925760) with MAP_HUGETLB
2026-02-10 15:05:45.897 CET [775386] DEBUG:  segment[main]: mmap(1038090240)
2026-02-10 15:05:46.142 CET [775386] DEBUG:  segment[buffers]: mmap(34361835520)
2026-02-10 15:05:46.388 CET [775386] FATAL:  segment[buffers]: could
not allocate space for anonymous file: No space left on device

so this time it was
    [pid 775426] memfd_create("main", MFD_HUGETLB) = 4
    [pid 775426] mmap(NULL, 1038090240, PROT_NONE,
MAP_SHARED|MAP_NORESERVE|MAP_HUGETLB, 4, 0) = 0x7f8b6b200000
    [pid 775426] ftruncate(4, 1038090240)   = 0
    [pid 775426] fallocate(4, 0, 0, 1038090240) = 0
    [pid 775426] memfd_create("buffers", MFD_HUGETLB) = 5
    [pid 775426] mmap(NULL, 34361835520, PROT_NONE,
MAP_SHARED|MAP_NORESERVE|MAP_HUGETLB, 5, 0) = 0x7f836b000000
    [pid 775426] ftruncate(5, 34361835520)  = 0
    [pid 775426] fallocate(5, 0, 0, 34361835520) = -1 ENOSPC (No space
left on device)

34361835520+1038090240 = 35399925760 bytes total = 16880, but I had
more than HPs free:
    
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:4750
    
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:4750
    
/sys/devices/system/node/node2/hugepages/hugepages-2048kB/free_hugepages:4750
    
/sys/devices/system/node/node3/hugepages/hugepages-2048kB/free_hugepages:4750

Run out of time to track down those HP bugs, just letting You know.

-J.


Reply via email to