On 9 Nov 99, at 4:20, Sturle Sunde wrote:
> > perhaps conditional compilation, with each binary incorporating
> > only the routines it needs for that length, could reduce the code
> > footprint and help performance
>
> It can help if it makes it possible for the compiler to eliminate jumps
> and branches that way.
Or optimizes jumps & branches. There's a significant performance
penalty associated with jumping to an instruction which is badly
aligned with the way in which blocks of code are prefetched for
decoding.
If unexecuted code is confined to prefetch blocks (and, better still,
virtual pages) which don't contain code which is actually executed,
the associated performance penalty is very small to non existent.
>
> > Or perhaps the compiler and OS already do a good job at keeping only
> > needed program segments in cache, in which case the problem lies yet
> > elsewhere.
>
> Usualy.
This actually has a lot to do with the hardware. It's not the size of
the cache so much as its organization. If the L1 data cache uses e.g.
32 byte lines, like the Pentium, and is only 4-way associative, you
have direct access to _at most_ only 128 bytes of data - however big
the cache itself is. Worse, if you access data in such a way that
there are multiples of 128 bytes between accessed elements, you will
in effect be using only one cache line, and you will get at least
four times as many cache misses as you'd expect! The L2 cache helps
you out of this particular hole, but there's still a performance
penalty. This is an example of the sort of situation that compiler
writers find quite hard to deal with.
Also there's the point that we're dealing with virtual machines. The
best example I can think of to illustrate this is the old VAX problem
of initializing a square array:
DIMENSION A(1000,1000)
DO 10 J=1,1000
DO 10 I=1,1000
10 A(I,J)=0.0
runs very many times faster (well over 100 times!) if you simply swap
round the DO statements (assuming that you have a maximum working set
size of 512 pages = 256KB, which was typical on early VAX
installations).
The reason is that one way writes the elements in virtual memory
order, with just one demand-zero page fault on first access and just
one write dirty page fault on completion; running by "columns"
instead of "rows" generates 127 extra re-read and write dirty page
faults for each of the 7813 pages referenced. And, with a small
working set, many of the page faults would result in actual I/O to
the swap/page file as well as the overhead of actually resolving
virtual addresses to physical.
If you're writing in C, you'd probably use pointers instead of array
subscripts to do this particular job; but the required order of the
loops for efficient operation would be _reversed_ compared with
Fortran.
You really do need to use profiling tools to evaluate code; you also
need detailed knowledge of the architecture you're working with in
order to get things really well optimized.
Compiler design & implemetation can and does make a significant
difference, but we're a long way from being able to tell a compiler,
e.g., "Make me a program to LL test Mersenne numbers; I care about
execution speed at the expense of everything else".
> The only secure way to get this right is to use good profiling feedback
> which can record information of how often each variable and constant is
> needed, find patterns in how the program accesses what memory, where
> branches tend to go, etc, and then use this information to optimize
> register allocations, prefetching, etc in a second pass trough the
> compiler.
Actually you need to repeat this step until every change you try
makes things worse rather than better.
Regards
Brian Beesley
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers