Hi,

On 2026-02-16 20:22:51 +0530, Ashutosh Bapat wrote:
> On Fri, Feb 13, 2026 at 5:33 PM Heikki Linnakangas <[email protected]> wrote:
> >
> > On 13/02/2026 13:47, Ashutosh Bapat wrote:
> > > `man madvise` has this
> > >         MADV_REMOVE (since Linux 2.6.16)
> > >                Free  up a given range of pages and its associated
> > > backing store.  This is equivalent to punching a
> > >                hole in the corresponding byte range of the backing
> > > store (see fallocate(2)).  Subsequent  accesses
> > >                in the specified address range will see bytes containing 
> > > zero.
> > >
> > >                The  specified  address  range  must be mapped shared
> > > and writable.  This flag cannot be applied to
> > >                locked pages, Huge TLB pages, or VM_PFNMAP pages.
> > >
> > >                In the initial implementation, only tmpfs(5) was
> > > supported MADV_REMOVE; but since  Linux  3.5,  any
> > >                filesystem  which  supports  the  fallocate(2)
> > > FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE.
> > >                Hugetlbfs fails with the error EINVAL and other
> > > filesystems fail with the error EOPNOTSUPP.
> > >
> > > It says the flag can not be applied to Huge TLB pages. We won't be
> > > able to make resizable shared memory structures allocated with huge
> > > pages. That seems like a serious restriction.
> >
> > Per https://man7.org/linux/man-pages/man2/madvise.2.html:
> >
> > MADV_REMOVE (since Linux 2.6.16)
> >                ...
> >
> >                Support for the Huge TLB filesystem was added in Linux
> >                v4.3.
> >
> > > I may be misunderstanding something, but it seems like this is useful
> > > to free already allocated memory, not necessarily allocate more
> > > memory. I don't understand how a user would start with a larger
> > > reserved address space with only small portions of that space being
> > > backed by memory.
> >
> > Hmm, I guess you'll need to use MAP_NORESERVE in the first mmap() call.
> > to reserve address space for the maximum size, and then
> > madvise(MADV_POPULATE_WRITE) using the initial size. Later,
> > madvise(MADV_REMOVE) to shrink, and madvise(MADV_POPULATE_WRITE) to grow
> > again.
> 
> Thank you for the hint. Also thanks to Andres's idea, the resizable
> structure patch is quite small now. Actually, after experimenting with
> madvise, memfd_create and ftruncate(), I see that MADV_POPULATE_WRITE
> is not required at all. We don't have to do anything to expand a
> structure. Memory will be allocated as and when the program writes to
> it.

I think we *do* want the MADV_POPULATE_WRITE, at least when using huge pages,
because otherwise you'll get a SIGBUS when accessing the memory if there is no
huge page available anymore.


> I also discovered things that I didn't know about.
> 1. ftruncate() sets the size of the file but it doesn't allocate the
> memory pages.

Right.


> 2. to use madvise() the address needs to be backed by a file, so
> memfd_create is a must.

I am quite sure that that is not true.  I hacked this up with today's
postgres, and the madvise works with the mmap() backed allocation from
sysv_shmem.c, which is anonymous.

What made you conclude that that is the case?


> 4. the address and length passed to madvise needs to be page aligned,
> but that passed to fallocate() needn't be. `man fallocate` says
> "Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
> 2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
> range starting at offset and continuing for len bytes. Within the
> specified range, partial filesystem blocks are zeroed, and whole
> filesystem blocks are removed from the file.". It seems to be
> automatically taking care of the page size. So using fallocate()
> simplifies logic. Further `man madvise` says "but since Linux 3.5, any
> filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode
> also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is
> guaranteed to be available on a system which supports MADV_REMOVE.

I think it makes no sense to support resizing below page size
granularity. What's the point of doing that?


> Using fallocate() (or madvise()) to free memory, we don't need
> multiple segments. So much less code churn compared to the multiple
> mappings approach. However, there is one drawback. In the multiple
> mapping approach access beyond the current size of the structure would
> result in segfault or bus error. But in the fallocate/madvise approach
> such an access does not cause a crash. A write beyond the pages that
> fit the current size of the structure causes more memory to be
> allocated silently. A read returns 0s. So, there's a possibility that
> bugs in size calculations might go unnoticed. I think that's how it
> works even today, access in the yet un-allocated part of the shared
> memory will simply go unnoticed.

If that's something you care about, you can mprotect(PROT_NONE) the relevant
regions.

Greetings,

Andres Freund


Reply via email to