Re: [gentoo-user] [Long] Automatic CFLAGS benchmark

Sami Näätänen Fri, 14 Nov 2003 15:48:31 -0800

On Friday 14 November 2003 23:19, Danilo Piazzalunga wrote:
> Alle 00:13, venerdÃ 14 novembre 2003, William Kenworthy ha scritto:
> > you might also look at -falign-functions=8/16/32 as well.  On an
> > athlon tbird 1.4, 4 had zero gain, 8 and 16 were slower than 4, but
> > 32 was consistantly a little better.  Possibly because its a 32 bit
> > system
>
> Align functions to as much as 32-byte boundaries? Please show your
> results, I'm a bit curious


And 64 might be even better in some cases, because AMD has 64 byte cache 
lines. So if you always align in the cacheline boundary you may consume 
much memory, and possibly cache as well, but if it's a cache miss then 
you have greater chance of getting the code with less cache lines.

As an example:

Cacheline in example is 16 bytes, because I'm lazy. :)

Possible 16 byte long function align's
when -falign-functions=x
a x=4
b x=8
c x=16

Byte offsets from 0 to 31

0000000000111111 1111122222222233
0123456789012345 6789012345678901
aaaaaaaaaaaaaaaa
    aaaaaaaaaaaa aaaa
        aaaaaaaa aaaaaaaa
            aaaa aaaaaaaaaaaa
                 aaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbb
        bbbbbbbb bbbbbbbb
                 bbbbbbbbbbbbbbbb
cccccccccccccccc
                 cccccccccccccccc

As these possible cases with different values shows, the only one that
always uses optimal amount of cachelines is the choice, which uses the 
align boundary of the cache line. So why wouldn't we align all code to 
this 64 byte boundary? The answer is the amount of wasted space.
So say I have three functions that will fit to say 16 cachelines if 
aligned to 4 byte boundaries, but they can consume up to 18 cachelines 
if aligned using cacheline size. If I need these in thight loop then I 
consume two cachelines too much.

I would say that it could be quite usefull to try this in Amd's, because 
they has quite a big amount of cache. If one has intel CPU then it's 
simply waste of memory, because current intel CPU's does not cache the 
original code, but it first breaks the code to micro kernel 
instructions and then caches these. The function align has not so big 
meaning in Intel CPU's, because the code is allmost allways prefetched 
long before the micro kernel instructions end in the pipeline. This is 
due to the excellent branch caching of Intel CPU's.



--
[EMAIL PROTECTED] mailing list

Re: [gentoo-user] [Long] Automatic CFLAGS benchmark

Reply via email to