Hi, On 2026-02-16 20:22:51 +0530, Ashutosh Bapat wrote: > On Fri, Feb 13, 2026 at 5:33 PM Heikki Linnakangas <[email protected]> wrote: > > > > On 13/02/2026 13:47, Ashutosh Bapat wrote: > > > `man madvise` has this > > > MADV_REMOVE (since Linux 2.6.16) > > > Free up a given range of pages and its associated > > > backing store. This is equivalent to punching a > > > hole in the corresponding byte range of the backing > > > store (see fallocate(2)). Subsequent accesses > > > in the specified address range will see bytes containing > > > zero. > > > > > > The specified address range must be mapped shared > > > and writable. This flag cannot be applied to > > > locked pages, Huge TLB pages, or VM_PFNMAP pages. > > > > > > In the initial implementation, only tmpfs(5) was > > > supported MADV_REMOVE; but since Linux 3.5, any > > > filesystem which supports the fallocate(2) > > > FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE. > > > Hugetlbfs fails with the error EINVAL and other > > > filesystems fail with the error EOPNOTSUPP. > > > > > > It says the flag can not be applied to Huge TLB pages. We won't be > > > able to make resizable shared memory structures allocated with huge > > > pages. That seems like a serious restriction. > > > > Per https://man7.org/linux/man-pages/man2/madvise.2.html: > > > > MADV_REMOVE (since Linux 2.6.16) > > ... > > > > Support for the Huge TLB filesystem was added in Linux > > v4.3. > > > > > I may be misunderstanding something, but it seems like this is useful > > > to free already allocated memory, not necessarily allocate more > > > memory. I don't understand how a user would start with a larger > > > reserved address space with only small portions of that space being > > > backed by memory. > > > > Hmm, I guess you'll need to use MAP_NORESERVE in the first mmap() call. > > to reserve address space for the maximum size, and then > > madvise(MADV_POPULATE_WRITE) using the initial size. Later, > > madvise(MADV_REMOVE) to shrink, and madvise(MADV_POPULATE_WRITE) to grow > > again. > > Thank you for the hint. Also thanks to Andres's idea, the resizable > structure patch is quite small now. Actually, after experimenting with > madvise, memfd_create and ftruncate(), I see that MADV_POPULATE_WRITE > is not required at all. We don't have to do anything to expand a > structure. Memory will be allocated as and when the program writes to > it.
I think we *do* want the MADV_POPULATE_WRITE, at least when using huge pages, because otherwise you'll get a SIGBUS when accessing the memory if there is no huge page available anymore. > I also discovered things that I didn't know about. > 1. ftruncate() sets the size of the file but it doesn't allocate the > memory pages. Right. > 2. to use madvise() the address needs to be backed by a file, so > memfd_create is a must. I am quite sure that that is not true. I hacked this up with today's postgres, and the madvise works with the mmap() backed allocation from sysv_shmem.c, which is anonymous. What made you conclude that that is the case? > 4. the address and length passed to madvise needs to be page aligned, > but that passed to fallocate() needn't be. `man fallocate` says > "Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux > 2.6.38) in mode deallocates space (i.e., creates a hole) in the byte > range starting at offset and continuing for len bytes. Within the > specified range, partial filesystem blocks are zeroed, and whole > filesystem blocks are removed from the file.". It seems to be > automatically taking care of the page size. So using fallocate() > simplifies logic. Further `man madvise` says "but since Linux 3.5, any > filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode > also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is > guaranteed to be available on a system which supports MADV_REMOVE. I think it makes no sense to support resizing below page size granularity. What's the point of doing that? > Using fallocate() (or madvise()) to free memory, we don't need > multiple segments. So much less code churn compared to the multiple > mappings approach. However, there is one drawback. In the multiple > mapping approach access beyond the current size of the structure would > result in segfault or bus error. But in the fallocate/madvise approach > such an access does not cause a crash. A write beyond the pages that > fit the current size of the structure causes more memory to be > allocated silently. A read returns 0s. So, there's a possibility that > bugs in size calculations might go unnoticed. I think that's how it > works even today, access in the yet un-allocated part of the shared > memory will simply go unnoticed. If that's something you care about, you can mprotect(PROT_NONE) the relevant regions. Greetings, Andres Freund
