Hi, On 2022-11-03 10:21:23 -0700, Andres Freund wrote: > > - Add a "cold" __asm__ filler function that just takes up space, enough to > > push the end of the .text segment over the next aligned boundary, or to > > ~8MB in size. > > I don't understand why this is needed - as long as the pages are aligned to > 2MB, why do we need to fill things up on disk? The in-memory contents are the > relevant bit, no?
I now assume it's because you either observed the mappings set up by the loader to not include the space between the segments? With sufficient linker flags the segments are sufficiently aligned both on disk and in memory to just map more: bfd: -Wl,-zmax-page-size=0x200000,-zcommon-page-size=0x200000 Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align ... LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x00000000000c7f58 0x00000000000c7f58 R 0x200000 LOAD 0x0000000000200000 0x0000000000200000 0x0000000000200000 0x0000000000921d39 0x0000000000921d39 R E 0x200000 LOAD 0x0000000000c00000 0x0000000000c00000 0x0000000000c00000 0x00000000002626b8 0x00000000002626b8 R 0x200000 LOAD 0x0000000000fdf510 0x00000000011df510 0x00000000011df510 0x0000000000037fd6 0x000000000006a310 RW 0x200000 gold -Wl,-zmax-page-size=0x200000,-zcommon-page-size=0x200000,--rosegment Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align ... LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x00000000009230f9 0x00000000009230f9 R E 0x200000 LOAD 0x0000000000a00000 0x0000000000a00000 0x0000000000a00000 0x000000000033a738 0x000000000033a738 R 0x200000 LOAD 0x0000000000ddf4e0 0x0000000000fdf4e0 0x0000000000fdf4e0 0x000000000003800a 0x000000000006a340 RW 0x200000 lld: -Wl,-zmax-page-size=0x200000,-zseparate-loadable-segments LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x000000000033710c 0x000000000033710c R 0x200000 LOAD 0x0000000000400000 0x0000000000400000 0x0000000000400000 0x0000000000921cb0 0x0000000000921cb0 R E 0x200000 LOAD 0x0000000000e00000 0x0000000000e00000 0x0000000000e00000 0x0000000000020ae0 0x0000000000020ae0 RW 0x200000 LOAD 0x0000000001000000 0x0000000001000000 0x0000000001000000 0x00000000000174ea 0x0000000000049820 RW 0x200000 mold -Wl,-zmax-page-size=0x200000,-zcommon-page-size=0x200000,-zseparate-loadable-segments Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align ... LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x000000000032dde9 0x000000000032dde9 R 0x200000 LOAD 0x0000000000400000 0x0000000000400000 0x0000000000400000 0x0000000000921cbe 0x0000000000921cbe R E 0x200000 LOAD 0x0000000000e00000 0x0000000000e00000 0x0000000000e00000 0x00000000002174e8 0x0000000000249820 RW 0x200000 With these flags the "R E" segments all start on a 0x200000/2MiB boundary and are padded to the next 2MiB boundary. However the OS / dynamic loader only maps the necessary part, not all the zero padding. This means that if we were to issue a MADV_COLLAPSE, we can before it do an mremap() to increase the length of the mapping. MADV_COLLAPSE without mremap: tps = 1117335.766756 (without initial connection time) Performance counter stats for 'system wide': 1,169,012,466,070 cycles (55.53%) 729,146,640,019 instructions # 0.62 insn per cycle (66.65%) 7,062,923 itlb.itlb_flush (66.65%) 1,041,825,587 iTLB-loads (66.65%) 634,272,420 iTLB-load-misses # 60.88% of all iTLB cache accesses (66.66%) 27,018,254,873 itlb_misses.walk_active (66.68%) 610,639,252 itlb_misses.walk_completed_4k (44.47%) 24,262,549 itlb_misses.walk_completed_2m_4m (44.46%) 2,948 itlb_misses.walk_completed_1g (44.43%) 10.039217004 seconds time elapsed MADV_COLLAPSE with mremap: tps = 1140869.853616 (without initial connection time) Performance counter stats for 'system wide': 1,173,272,878,934 cycles (55.53%) 746,008,850,147 instructions # 0.64 insn per cycle (66.65%) 7,538,962 itlb.itlb_flush (66.65%) 799,861,088 iTLB-loads (66.65%) 254,347,048 iTLB-load-misses # 31.80% of all iTLB cache accesses (66.66%) 14,427,296,885 itlb_misses.walk_active (66.69%) 221,811,835 itlb_misses.walk_completed_4k (44.47%) 32,881,405 itlb_misses.walk_completed_2m_4m (44.46%) 3,043 itlb_misses.walk_completed_1g (44.43%) 10.038517778 seconds time elapsed compared to a run without any huge pages (via THP or MADV_COLLAPSE): tps = 1034960.102843 (without initial connection time) Performance counter stats for 'system wide': 1,183,743,785,066 cycles (55.54%) 678,525,810,443 instructions # 0.57 insn per cycle (66.65%) 7,163,304 itlb.itlb_flush (66.65%) 2,952,660,798 iTLB-loads (66.65%) 2,105,431,590 iTLB-load-misses # 71.31% of all iTLB cache accesses (66.66%) 80,593,535,910 itlb_misses.walk_active (66.68%) 2,105,377,810 itlb_misses.walk_completed_4k (44.46%) 1,254,156 itlb_misses.walk_completed_2m_4m (44.46%) 3,366 itlb_misses.walk_completed_1g (44.44%) 10.039821650 seconds time elapsed So a 7.96% win from no-huge-pages to MADV_COLLAPSE and a further 2.11% win from there to also using mremap(), yielding a total of 10.23%. It's similar across runs. On my system the other libraries unfortunately aren't aligned properly. It'd be nice to also remap at least libc. The majority of the remaining misses are from the vdso (too small for a huge page), libc (not aligned properly), returning from system calls (which flush the itlb) and pgbench / libpq (I didn't add the mremap there, there's not enough code for a huge page without it). Greetings, Andres Freund