Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions

Ingo Molnar Thu, 21 May 2015 07:04:20 -0700

* Denys Vlasenko <[email protected]> wrote:

> > The AMD system, with a starkly different x86 microarchitecture, is 
> > showing similar characteristics:
> > 
> > linux-falign-functions=_64-bytes/res-amd.txt:        108,886,550      
> > L1-icache-load-misses                                         ( +-  0.10% ) 
> >  (100.00%)
> > linux-falign-functions=_32-bytes/res-amd.txt:        110,433,214      
> > L1-icache-load-misses                                         ( +-  0.15% ) 
> >  (100.00%)
> > linux-falign-functions=__1-bytes/res-amd.txt:        113,623,200      
> > L1-icache-load-misses                                         ( +-  0.17% ) 
> >  (100.00%)
> > linux-falign-functions=128-bytes/res-amd.txt:        119,100,216      
> > L1-icache-load-misses                                         ( +-  0.22% ) 
> >  (100.00%)
> > linux-falign-functions=_16-bytes/res-amd.txt:        122,916,937      
> > L1-icache-load-misses                                         ( +-  0.15% ) 
> >  (100.00%)
> > linux-falign-functions=__8-bytes/res-amd.txt:        123,810,566      
> > L1-icache-load-misses                                         ( +-  0.18% ) 
> >  (100.00%)
> > linux-falign-functions=__2-bytes/res-amd.txt:        124,337,908      
> > L1-icache-load-misses                                         ( +-  0.71% ) 
> >  (100.00%)
> > linux-falign-functions=__4-bytes/res-amd.txt:        125,221,805      
> > L1-icache-load-misses                                         ( +-  0.09% ) 
> >  (100.00%)
> > linux-falign-functions=256-bytes/res-amd.txt:        135,761,433      
> > L1-icache-load-misses                                         ( +-  0.18% ) 
> >  (100.00%)
> > linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt:        159,918,181      
> > L1-icache-load-misses                                         ( +-  0.10% ) 
> >  (100.00%)
> > linux-falign-functions=512-bytes/res-amd.txt:        170,307,064      
> > L1-icache-load-misses                                         ( +-  0.26% ) 
> >  (100.00%)
> > 
> > 64 bytes is a similar sweet spot. Note that the penalty at 512 bytes 
> > is much steeper than on Intel systems: cache associativity is likely 
> > lower on this AMD CPU.
> > 
> > Interestingly the 1 byte alignment result is still pretty good on AMD 
> > systems - and I used the exact same kernel image on both systems, so 
> > the layout of the functions is exactly the same.
> > 
> > Elapsed time is noisier, but shows a similar trend:
> > 
> > linux-falign-functions=_64-bytes/res-amd.txt:        1.928409143 seconds 
> > time elapsed                                          ( +-  2.74% )
> > linux-falign-functions=128-bytes/res-amd.txt:        1.932961745 seconds 
> > time elapsed                                          ( +-  2.18% )
> > linux-falign-functions=__8-bytes/res-amd.txt:        1.940703051 seconds 
> > time elapsed                                          ( +-  1.84% )
> > linux-falign-functions=__1-bytes/res-amd.txt:        1.940744001 seconds 
> > time elapsed                                          ( +-  2.15% )
> > linux-falign-functions=_32-bytes/res-amd.txt:        1.962074787 seconds 
> > time elapsed                                          ( +-  2.38% )
> > linux-falign-functions=_16-bytes/res-amd.txt:        2.000941789 seconds 
> > time elapsed                                          ( +-  1.18% )
> > linux-falign-functions=__4-bytes/res-amd.txt:        2.002305627 seconds 
> > time elapsed                                          ( +-  2.75% )
> > linux-falign-functions=256-bytes/res-amd.txt:        2.003218532 seconds 
> > time elapsed                                          ( +-  3.16% )
> > linux-falign-functions=__2-bytes/res-amd.txt:        2.031252839 seconds 
> > time elapsed                                          ( +-  1.77% )
> > linux-falign-functions=512-bytes/res-amd.txt:        2.080632439 seconds 
> > time elapsed                                          ( +-  1.06% )
> > linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt:        2.346644318 seconds 
> > time elapsed                                          ( +-  2.19% )
> > 
> > 64 bytes alignment is the sweet spot here as well, it's 3.7% 
> > faster than the default 16 bytes alignment.
> 
> In AMD, 64 bytes win too, yes, but by a *very* small margin. 8 bytes 
> and 1 byte alignments have basically same timings, and both take 
> what, +0.63% of time longer to run?


See my previous mail why this is a misleading conclusion: the timings 
have so much stddev that they are not really comparable. But if you 
look at the cachemiss you'll see the true difference:

 linux-falign-functions=_64-bytes/res-amd.txt:        108,886,550      
L1-icache-load-misses                                         ( +-  0.10% )  
(100.00%)
 linux-falign-functions=__8-bytes/res-amd.txt:        123,810,566      
L1-icache-load-misses                                         ( +-  0.18% )  
(100.00%)
 linux-falign-functions=__1-bytes/res-amd.txt:        113,623,200      
L1-icache-load-misses                                         ( +-  0.17% )  
(100.00%)

1 byte alignment isn't optimal on AMD either.

> > +        #
> > +        # Allocate a separate cacheline for every function,
> > +        # for optimal instruction cache packing:
> > +        #
> > +        KBUILD_CFLAGS += -falign-functions=$(CONFIG_X86_FUNCTION_ALIGNMENT)
> 
> How about  -falign-functions=CONFIG_X86_FUNCTION_ALIGNMENT/2 + 1  instead?
> 
> This avoids pathological cases where function starting just a few 
> bytes after 64-bytes boundary gets aligned to the next one, wasting 
> ~60 bytes.

Well, that should be pretty similar to the 32-byte aligned numbers I 
already posted, right?

The 32-byte numbers are better than the default 16-byte alignment, but 
64-byte alignment was even better.

A quick build shows:

   text    data     bss     dec     hex filename
13534242        2579560 1634304 17748106        10ed08a 
linux-falign-functions=__1-bytes/vmlinux
13554530        2579560 1634304 17768394        10f1fca 
linux-falign-functions=__2-bytes/vmlinux
13590946        2579560 1634304 17804810        10fae0a 
linux-falign-functions=__4-bytes/vmlinux
13658786        2579560 1634304 17872650        110b70a 
linux-falign-functions=__8-bytes/vmlinux
13799602        2579560 1634304 18013466        112dd1a 
linux-falign-functions=_16-bytes/vmlinux
13943906        2579560 1634304 18157770        11510ca 
linux-falign-functions=_33-bytes/vmlinux [1]
14075330        2579560 1634304 18289194        117122a 
linux-falign-functions=_32-bytes/vmlinux [2]
14664898        2579560 1634304 18878762        120112a 
linux-falign-functions=_64-bytes/vmlinux
15980994        2579560 1634304 20194858        134262a 
linux-falign-functions=128-bytes/vmlinux
19038018        2591848 1634304 23264170        162fbaa 
linux-falign-functions=256-bytes/vmlinux
26391106        2591848 1634304 30617258        1d32eaa 
linux-falign-functions=512-bytes/vmlinux

Interestingly the -falign-functions=33 kernel is marginally smaller 
than the -falign-functions=32 kernel, by 0.1%.

But the numbers aren't exceptional:

       671,702,272      L1-icache-load-misses                                   
      ( +-  0.06% )  (100.00%)
    11,892,913,320      instructions                                            
      ( +-  0.01% )
           300,030      context-switches                                        
      ( +-  0.00% )

       7.349312237 seconds time elapsed                                         
 ( +-  1.20% )

Compared to the others it's on rank 3:

linux-falign-functions=_64-bytes/res.txt:        647,853,942      
L1-icache-load-misses                                         ( +-  0.07% )  
(100.00%)
linux-falign-functions=128-bytes/res.txt:        669,401,612      
L1-icache-load-misses                                         ( +-  0.08% )  
(100.00%)
linux-falign-functions=_33-bytes/res.txt: [NEW]  671,702,272      
L1-icache-load-misses                                         ( +-  0.06% )  
(100.00%)
linux-falign-functions=_32-bytes/res.txt:        685,969,043      
L1-icache-load-misses                                         ( +-  0.08% )  
(100.00%)
linux-falign-functions=256-bytes/res.txt:        699,130,207      
L1-icache-load-misses                                         ( +-  0.06% )  
(100.00%)
linux-falign-functions=512-bytes/res.txt:        699,130,207      
L1-icache-load-misses                                         ( +-  0.06% )  
(100.00%)
linux-falign-functions=_16-bytes/res.txt:        706,080,917      
L1-icache-load-misses   [vanilla kernel]                      ( +-  0.05% )  
(100.00%)
linux-falign-functions=__1-bytes/res.txt:        724,539,055      
L1-icache-load-misses                                         ( +-  0.31% )  
(100.00%)
linux-falign-functions=__4-bytes/res.txt:        725,707,848      
L1-icache-load-misses                                         ( +-  0.12% )  
(100.00%)
linux-falign-functions=__8-bytes/res.txt:        726,543,194      
L1-icache-load-misses                                         ( +-  0.04% )  
(100.00%)
linux-falign-functions=__2-bytes/res.txt:        738,946,179      
L1-icache-load-misses                                         ( +-  0.12% )  
(100.00%)
linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt:        921,910,808      
L1-icache-load-misses                                         ( +-  0.05% )  
(100.00%)

So I still think the best strategy would be a variant of what I 
mentioned before:

  1) 'small functions': whose size is 1-63 bytes.

  2) 'large functions': whose size is 64- bytes.

  3) align all large functions to 64 bytes cachelines.

  4) pack all small functions into remaining large function tail 
     holes, without them spilling over into the next cacheline

  5) pack the remaining small functions with each other, tightly, as 
     long as they don't spill over to the next cacheline.

  6) once optimization steps 1-5 are performed, for all cachelines 
     that contain small functions and still have a hole near the end 
     of the cacheline, allow functions to be aligned to 32/16/8/4/2 
     bytes as long as they still fit into the cacheline.

So for example lets consider this layout with tight packing:
  
  [small-fn1][small-fn2][large-fn3...........................][small-fn4 ..... 
][hole............]
  [......... cacheline 1 ........][......... cacheline 2 ........][......... 
cacheline 3 ........]

Both 'fn3' and 'fn4' are on two cachelines.

After step 5 we'd get such a packing:

  
[small-fn1][small-fn2][hole....][large-fn3...........................][small-fn4
 ..... ][hole..]
  [......... cacheline 1 ........][......... cacheline 2 ........][......... 
cacheline 3 ........]

fn1, fn2 and fn4 each have their own cacheline, while large-fn3 
necessarily has to spill into the next cacheline due to its size. 
'fn4' is now on a single cache line.

After step 6 we'd get some extra intra-cacheline alignment by 
utilizing the holes:

  
[small-fn1]..[small-fn2][hole..][large-fn3...........................]...[small-fn4
 ..... ][hle]
  [......... cacheline 1 ........][......... cacheline 2 ........][......... 
cacheline 3 ........]

Note how small-fn2 got moved forward by 2 bytes, to move it to an 8 
bytes boundary and small-fn4 got moved forward by 3 bytes - at the 
expense of the holes at the end of the cachelines.

This function alignment optimization basically minimizes the number of 
cache line boundaries intersecting individual functions, so this 
minimizes the I$ footprint aggressively.

This is not a trivial computation btw.: because even if we do this 
only per .o object file we'd have to consider nr_fns! permutations of 
all possible function ordering. We'd want to pick the ordering that 
matches the above 1-6 constraints, _and_ which is as close to 'source 
code ordering' as possible.

Such an optimization cannot be described via simple local alignment 
rules like any variant of -falign-functions, as it requires reordering 
of functions.

Thanks,

        Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions

Reply via email to