Hi, On 2022-11-05 12:54:18 +0700, John Naylor wrote: > On Sat, Nov 5, 2022 at 1:33 AM Andres Freund <and...@anarazel.de> wrote: > > I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just > hardcode > > the address / length), and it seems to work nicely. > > > > With the weird caveat that on fs one needs to make sure that the > executable > > doesn't reflinks to reuse parts of other files, and that the mold linker > and > > cp do... Not a concern on ext4, but on xfs. I took to copying the postgres > > binary with cp --reflink=never > > What happens otherwise? That sounds like a difficult thing to guard against.
MADV_COLLAPSE fails, but otherwise things continue on. I think it's mostly an issue on dev systems, not on prod systems, because there the files will be be unpacked from a package or such. > > On 2022-11-03 10:21:23 -0700, Andres Freund wrote: > > > > - Add a "cold" __asm__ filler function that just takes up space, > enough to > > > > push the end of the .text segment over the next aligned boundary, or > to > > > > ~8MB in size. > > > > > > I don't understand why this is needed - as long as the pages are > aligned to > > > 2MB, why do we need to fill things up on disk? The in-memory contents > are the > > > relevant bit, no? > > > > I now assume it's because you either observed the mappings set up by the > > loader to not include the space between the segments? > > My knowledge is not quite that deep. The iodlr repo has an example "hello > world" program, which links with 8 filler objects, each with 32768 > __attribute((used)) dummy functions. I just cargo-culted that idea and > simplified it. Interestingly enough, looking through the commit history, > they used to align the segments via linker flags, but took it out here: > > https://github.com/intel/iodlr/pull/25#discussion_r397787559 > > ...saying "I'm not sure why we added this". :/ That was about using a linker script, not really linker flags though. I don't think the dummy functions are a good approach, there were plenty things after it when I played with them. > I quickly tried to align the segments with the linker and then in my patch > have the address for mmap() rounded *down* from the .text start to the > beginning of that segment. It refused to start without logging an error. Hm, what linker was that? I did note that you need some additional flags for some of the linkers. > > With these flags the "R E" segments all start on a 0x200000/2MiB boundary > and > > are padded to the next 2MiB boundary. However the OS / dynamic loader only > > maps the necessary part, not all the zero padding. > > > > This means that if we were to issue a MADV_COLLAPSE, we can before it do > an > > mremap() to increase the length of the mapping. > > I see, interesting. What location are you passing for madvise() and > mremap()? The beginning of the segment (for me has .init/.plt) or an > aligned boundary within .text? I started postgres with setarch -R, looked at /proc/$pid/[s]maps to see the start/end of the r-xp mapped segment. Here's my hacky code, with a bunch of comments added. void *addr = (void*) 0x555555800000; void *end = (void *) 0x555555e09000; size_t advlen = (uintptr_t) end - (uintptr_t) addr; const size_t bound = 1024*1024*2 - 1; size_t advlen_up = (advlen + bound - 1) & ~(bound - 1); void *r2; /* * Increase size of mapping to cover the tailing padding to the next * segment. Otherwise all the code in that range can't be put into * a huge page (access in the non-mapped range needs to cause a fault, * hence can't be in the huge page). * XXX: Should proably assert that that space is actually zeroes. */ r2 = mremap(addr, advlen, advlen_up, 0); if (r2 == MAP_FAILED) fprintf(stderr, "mremap failed: %m\n"); else if (r2 != addr) fprintf(stderr, "mremap wrong addr: %m\n"); else advlen = advlen_up; /* * The docs for MADV_COLLAPSE say there should be at least one page * in the mapped space "for every eligible hugepage-aligned/sized * region to be collapsed". I just forced that. But probably not * necessary. */ r = madvise(addr, advlen, MADV_WILLNEED); if (r != 0) fprintf(stderr, "MADV_WILLNEED failed: %m\n"); r = madvise(addr, advlen, MADV_POPULATE_READ); if (r != 0) fprintf(stderr, "MADV_POPULATE_READ failed: %m\n"); /* * Make huge pages out of it. Requires at least linux 6.1. We could * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all that * much in older kernels. */ #define MADV_COLLAPSE 25 r = madvise(addr, advlen, MADV_COLLAPSE); if (r != 0) fprintf(stderr, "MADV_COLLAPSE failed: %m\n"); A real version would have to open /proc/self/maps and do this for at least postgres' r-xp mapping. We could do it for libraries too, if they're suitably aligned (both in memory and on-disk). Greetings, Andres Freund