Re: Better shared data structure management and resizable shared data structures

Ashutosh Bapat Wed, 18 Feb 2026 07:51:25 -0800

On Wed, Feb 18, 2026 at 9:17 PM Ashutosh Bapat
<[email protected]> wrote:
>
> On Tue, Feb 17, 2026 at 5:06 PM Ashutosh Bapat
> <[email protected]> wrote:
> >
> > On Mon, Feb 16, 2026 at 11:26 PM Andres Freund <[email protected]> wrote:
> > >
> > > I think we *do* want the MADV_POPULATE_WRITE, at least when using huge 
> > > pages,
> > > because otherwise you'll get a SIGBUS when accessing the memory if there 
> > > is no
> > > huge page available anymore.
> > >
> >
> > Ok.
> >
> > Jakub's experiments [1] showed that fallocate()ing shared memory would
> > slow down postmaster start on a slow machine. I suppose the same thing
> > applies to MADV_POPULATE_WRITE. And we don't do that today even in the
> > case of huge pages; so we already have that problem.
> >
> > If we perform MADV_POPULATE_WRITE, do we want it only for resizable
> > shared memory structures or all the structures in the shared memory?
>
> In the attached patches, I have used MADV_POPULATE_WRITE during
> resizing, which is run time operation. When the structures are
> allocated when server starts, they are usually initialised, so we end
> up allocating memory for the same. So we don't need
> MADV_POPULATE_WRITE at that time, and thus avoid affecting startup
> slowness, if any. Buffer blocks are not initialised at the time of
> starting the server, so their memory is allocated as they are
> accessed. But that's how it works today, so no change there.
>
> >
> >
> > >
> > > > 4. the address and length passed to madvise needs to be page aligned,
> > > > but that passed to fallocate() needn't be. `man fallocate` says
> > > > "Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
> > > > 2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
> > > > range starting at offset and continuing for len bytes. Within the
> > > > specified range, partial filesystem blocks are zeroed, and whole
> > > > filesystem blocks are removed from the file.". It seems to be
> > > > automatically taking care of the page size. So using fallocate()
> > > > simplifies logic. Further `man madvise` says "but since Linux 3.5, any
> > > > filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode
> > > > also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is
> > > > guaranteed to be available on a system which supports MADV_REMOVE.
> > >
> > > I think it makes no sense to support resizing below page size
> > > granularity. What's the point of doing that?
> > >
> >
> > No point really. But we can not control the extensions which want to
> > specify a maximum size smaller than a page size. They wouldn't know
> > what page size the underlying machine will have, especially with huge
> > pages which have a wide range of sizes. Even in the case of shared
> > buffers, a value of max_shared_buffers may cause buffer blocks to span
> > pages but other structures may fit a page.
> >
> > In the attached patches, if a resizable structure is such that its
> > max_size is smaller than a page size, it is treated as a fixed
> > structure with size = max_size. Any request to resize such structures
> > will simply update the metadata without actual madvise operation. Only
> > the structures whose max_size > page_size would be treated as truly
> > resizable and will use madvise. You bring another interesting point.
> > If a resizable structure has a maximum size higher than the page size,
> > but it is allocated such that the initial part of it is on a partially
> > allocated page and the last part of it is on another partially
> > allocated page, those pages are never freed because of adjoining
> > structures. Per the logic in the attached patches, all the fixed (or
> > pseudo-resizable structures) are packed together. The resizable
> > structures start on a page boundary and their max_sizes are adjusted
> > to be page aligned. That way we can release pages when the structure
> > shrinks more than a page.
> >
>
> > >
> > > > Using fallocate() (or madvise()) to free memory, we don't need
> > > > multiple segments. So much less code churn compared to the multiple
> > > > mappings approach. However, there is one drawback. In the multiple
> > > > mapping approach access beyond the current size of the structure would
> > > > result in segfault or bus error. But in the fallocate/madvise approach
> > > > such an access does not cause a crash. A write beyond the pages that
> > > > fit the current size of the structure causes more memory to be
> > > > allocated silently. A read returns 0s. So, there's a possibility that
> > > > bugs in size calculations might go unnoticed. I think that's how it
> > > > works even today, access in the yet un-allocated part of the shared
> > > > memory will simply go unnoticed.
> > >
> > > If that's something you care about, you can mprotect(PROT_NONE) the 
> > > relevant
> > > regions.
> >
> > I am fine, if we let go of this protection while getting rid of
> > multiple segments, if we all agree to do so.
> >
> > I could be wrong, but mprotect needs to be executed in every backend
> > where the memory is mapped and then a new backend needs to inherit it
> > from the postmaster. Makes resizing complex since it has to touch
> > every backend. So avoiding mprotect is better.
> >
>


Sent too soon.

I have also reworked the test into a TAP test which looks stable than
the earlier version. Haven't had any failures on my laptop.

> If the general approach in the attached patches looks good, we can
> work on improving the 0001 + 0002 to be committable and then work on
> 0003.

The resizable memory patch works only in linux where
MADV_POPULATE_WRITE and MADV_REMOVE are supported on anonymous shared
memory. On other platforms and where that support doesn't exist, we
will need to disable the feature for now. That work remains. Also the
TODOs need to be addressed.

-- 
Best Wishes,
Ashutosh Bapat

Re: Better shared data structure management and resizable shared data structures

Reply via email to