On 9 Nov 99, at 4:20, Sturle Sunde wrote:

> > perhaps conditional compilation, with each binary incorporating
> > only the routines it needs for that length, could reduce the code
> > footprint and help performance
> 
> It can help if it makes it possible for the compiler to eliminate jumps 
> and branches that way.

Or optimizes jumps & branches. There's a significant performance 
penalty associated with jumping to an instruction which is badly 
aligned with the way in which blocks of code are prefetched for 
decoding.

If unexecuted code is confined to prefetch blocks (and, better still, 
virtual pages) which don't contain code which is actually executed, 
the associated performance penalty is very small to non existent.
> 
> > Or perhaps the compiler and OS already do a good job at keeping only
> > needed program segments in cache, in which case the problem lies yet
> > elsewhere.
> 
> Usualy.

This actually has a lot to do with the hardware. It's not the size of 
the cache so much as its organization. If the L1 data cache uses e.g. 
32 byte lines, like the Pentium, and is only 4-way associative, you 
have direct access to _at most_ only 128 bytes of data - however big 
the cache itself is. Worse, if you access data in such a way that 
there are multiples of 128 bytes between accessed elements, you will 
in effect be using only one cache line, and you will get at least 
four times as many cache misses as you'd expect! The L2 cache helps 
you out of this particular hole, but there's still a performance 
penalty. This is an example of the sort of situation that compiler 
writers find quite hard to deal with.

Also there's the point that we're dealing with virtual machines. The 
best example I can think of to illustrate this is the old VAX problem 
of initializing a square array:

                DIMENSION A(1000,1000)
                DO 10 J=1,1000
                DO 10 I=1,1000
10              A(I,J)=0.0

runs very many times faster (well over 100 times!) if you simply swap 
round the DO statements (assuming that you have a maximum working set 
size of  512 pages = 256KB, which was typical on early VAX 
installations).

The reason is that one way writes the elements in virtual memory 
order, with just one demand-zero page fault on first access and just 
one write dirty page fault on completion; running by "columns" 
instead of "rows" generates 127 extra re-read and write dirty page 
faults for each of the 7813 pages referenced. And, with a small 
working set, many of the page faults would result in actual I/O to 
the swap/page file as well as the overhead of actually resolving 
virtual addresses to physical.

If you're writing in C, you'd probably use pointers instead of array 
subscripts to do this particular job; but the required order of the 
loops for efficient operation would be _reversed_ compared with 
Fortran.

You really do need to use profiling tools to evaluate code; you also 
need detailed knowledge of the architecture you're working with in 
order to get things really well optimized.

Compiler design & implemetation can and does make a significant 
difference, but we're a long way from being able to tell a compiler, 
e.g., "Make me a program to LL test Mersenne numbers; I care about 
execution speed at the expense of everything else".

> The only secure way to get this right is to use good profiling feedback 
> which can record information of how often each variable and constant is 
> needed, find patterns in how the program accesses what memory, where 
> branches tend to go, etc, and then use this information to optimize 
> register allocations, prefetching, etc in a second pass trough the 
> compiler.

Actually you need to repeat this step until every change you try 
makes things worse rather than better.


Regards
Brian Beesley
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to