On Friday 14 November 2003 23:19, Danilo Piazzalunga wrote:
> Alle 00:13, venerdà 14 novembre 2003, William Kenworthy ha scritto:
> > you might also look at -falign-functions=8/16/32 as well. On an
> > athlon tbird 1.4, 4 had zero gain, 8 and 16 were slower than 4, but
> > 32 was consistantly a little better. Possibly because its a 32 bit
> > system
>
> Align functions to as much as 32-byte boundaries? Please show your
> results, I'm a bit curious
And 64 might be even better in some cases, because AMD has 64 byte cache
lines. So if you always align in the cacheline boundary you may consume
much memory, and possibly cache as well, but if it's a cache miss then
you have greater chance of getting the code with less cache lines.
As an example:
Cacheline in example is 16 bytes, because I'm lazy. :)
Possible 16 byte long function align's
when -falign-functions=x
a x=4
b x=8
c x=16
Byte offsets from 0 to 31
0000000000111111 1111122222222233
0123456789012345 6789012345678901
aaaaaaaaaaaaaaaa
aaaaaaaaaaaa aaaa
aaaaaaaa aaaaaaaa
aaaa aaaaaaaaaaaa
aaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbb
bbbbbbbb bbbbbbbb
bbbbbbbbbbbbbbbb
cccccccccccccccc
cccccccccccccccc
As these possible cases with different values shows, the only one that
always uses optimal amount of cachelines is the choice, which uses the
align boundary of the cache line. So why wouldn't we align all code to
this 64 byte boundary? The answer is the amount of wasted space.
So say I have three functions that will fit to say 16 cachelines if
aligned to 4 byte boundaries, but they can consume up to 18 cachelines
if aligned using cacheline size. If I need these in thight loop then I
consume two cachelines too much.
I would say that it could be quite usefull to try this in Amd's, because
they has quite a big amount of cache. If one has intel CPU then it's
simply waste of memory, because current intel CPU's does not cache the
original code, but it first breaks the code to micro kernel
instructions and then caches these. The function align has not so big
meaning in Intel CPU's, because the code is allmost allways prefetched
long before the micro kernel instructions end in the pipeline. This is
due to the excellent branch caching of Intel CPU's.
--
[EMAIL PROTECTED] mailing list