Re: remap the .text segment into huge pages at run time

Andres Freund Fri, 04 Nov 2022 14:21:49 -0700

Hi,

On 2022-11-03 10:21:23 -0700, Andres Freund wrote:
> > - Add a "cold" __asm__ filler function that just takes up space, enough to
> > push the end of the .text segment over the next aligned boundary, or to
> > ~8MB in size.
>
> I don't understand why this is needed - as long as the pages are aligned to
> 2MB, why do we need to fill things up on disk? The in-memory contents are the
> relevant bit, no?


I now assume it's because you either observed the mappings set up by the
loader to not include the space between the segments?

With sufficient linker flags the segments are sufficiently aligned both on
disk and in memory to just map more:

bfd: -Wl,-zmax-page-size=0x200000,-zcommon-page-size=0x200000
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
...
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x00000000000c7f58 0x00000000000c7f58  R      0x200000
  LOAD           0x0000000000200000 0x0000000000200000 0x0000000000200000
                 0x0000000000921d39 0x0000000000921d39  R E    0x200000
  LOAD           0x0000000000c00000 0x0000000000c00000 0x0000000000c00000
                 0x00000000002626b8 0x00000000002626b8  R      0x200000
  LOAD           0x0000000000fdf510 0x00000000011df510 0x00000000011df510
                 0x0000000000037fd6 0x000000000006a310  RW     0x200000

gold -Wl,-zmax-page-size=0x200000,-zcommon-page-size=0x200000,--rosegment
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
...
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x00000000009230f9 0x00000000009230f9  R E    0x200000
  LOAD           0x0000000000a00000 0x0000000000a00000 0x0000000000a00000
                 0x000000000033a738 0x000000000033a738  R      0x200000
  LOAD           0x0000000000ddf4e0 0x0000000000fdf4e0 0x0000000000fdf4e0
                 0x000000000003800a 0x000000000006a340  RW     0x200000

lld: -Wl,-zmax-page-size=0x200000,-zseparate-loadable-segments
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x000000000033710c 0x000000000033710c  R      0x200000
  LOAD           0x0000000000400000 0x0000000000400000 0x0000000000400000
                 0x0000000000921cb0 0x0000000000921cb0  R E    0x200000
  LOAD           0x0000000000e00000 0x0000000000e00000 0x0000000000e00000
                 0x0000000000020ae0 0x0000000000020ae0  RW     0x200000
  LOAD           0x0000000001000000 0x0000000001000000 0x0000000001000000
                 0x00000000000174ea 0x0000000000049820  RW     0x200000

mold 
-Wl,-zmax-page-size=0x200000,-zcommon-page-size=0x200000,-zseparate-loadable-segments
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
...
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x000000000032dde9 0x000000000032dde9  R      0x200000
  LOAD           0x0000000000400000 0x0000000000400000 0x0000000000400000
                 0x0000000000921cbe 0x0000000000921cbe  R E    0x200000
  LOAD           0x0000000000e00000 0x0000000000e00000 0x0000000000e00000
                 0x00000000002174e8 0x0000000000249820  RW     0x200000

With these flags the "R E" segments all start on a 0x200000/2MiB boundary and
are padded to the next 2MiB boundary. However the OS / dynamic loader only
maps the necessary part, not all the zero padding.

This means that if we were to issue a MADV_COLLAPSE, we can before it do an
mremap() to increase the length of the mapping.


MADV_COLLAPSE without mremap:

tps = 1117335.766756 (without initial connection time)

 Performance counter stats for 'system wide':

 1,169,012,466,070      cycles                                                  
             (55.53%)
   729,146,640,019      instructions                     #    0.62  insn per 
cycle           (66.65%)
         7,062,923      itlb.itlb_flush                                         
             (66.65%)
     1,041,825,587      iTLB-loads                                              
             (66.65%)
       634,272,420      iTLB-load-misses                 #   60.88% of all iTLB 
cache accesses  (66.66%)
    27,018,254,873      itlb_misses.walk_active                                 
             (66.68%)
       610,639,252      itlb_misses.walk_completed_4k                           
             (44.47%)
        24,262,549      itlb_misses.walk_completed_2m_4m                        
             (44.46%)
             2,948      itlb_misses.walk_completed_1g                           
             (44.43%)

      10.039217004 seconds time elapsed


MADV_COLLAPSE with mremap:

tps = 1140869.853616 (without initial connection time)

 Performance counter stats for 'system wide':

 1,173,272,878,934      cycles                                                  
             (55.53%)
   746,008,850,147      instructions                     #    0.64  insn per 
cycle           (66.65%)
         7,538,962      itlb.itlb_flush                                         
             (66.65%)
       799,861,088      iTLB-loads                                              
             (66.65%)
       254,347,048      iTLB-load-misses                 #   31.80% of all iTLB 
cache accesses  (66.66%)
    14,427,296,885      itlb_misses.walk_active                                 
             (66.69%)
       221,811,835      itlb_misses.walk_completed_4k                           
             (44.47%)
        32,881,405      itlb_misses.walk_completed_2m_4m                        
             (44.46%)
             3,043      itlb_misses.walk_completed_1g                           
             (44.43%)

      10.038517778 seconds time elapsed


compared to a run without any huge pages (via THP or MADV_COLLAPSE):

tps = 1034960.102843 (without initial connection time)

 Performance counter stats for 'system wide':

 1,183,743,785,066      cycles                                                  
             (55.54%)
   678,525,810,443      instructions                     #    0.57  insn per 
cycle           (66.65%)
         7,163,304      itlb.itlb_flush                                         
             (66.65%)
     2,952,660,798      iTLB-loads                                              
             (66.65%)
     2,105,431,590      iTLB-load-misses                 #   71.31% of all iTLB 
cache accesses  (66.66%)
    80,593,535,910      itlb_misses.walk_active                                 
             (66.68%)
     2,105,377,810      itlb_misses.walk_completed_4k                           
             (44.46%)
         1,254,156      itlb_misses.walk_completed_2m_4m                        
             (44.46%)
             3,366      itlb_misses.walk_completed_1g                           
             (44.44%)

      10.039821650 seconds time elapsed


So a 7.96% win from no-huge-pages to MADV_COLLAPSE and a further 2.11% win
from there to also using mremap(), yielding a total of 10.23%. It's similar
across runs.


On my system the other libraries unfortunately aren't aligned properly. It'd
be nice to also remap at least libc. The majority of the remaining misses are
from the vdso (too small for a huge page), libc (not aligned properly),
returning from system calls (which flush the itlb) and pgbench / libpq (I
didn't add the mremap there, there's not enough code for a huge page without
it).

Greetings,

Andres Freund

Re: remap the .text segment into huge pages at run time

Reply via email to