Hi all,
I'm trying to optimize prime95 for the Pentium Pro/PII/PIII
architecture. I'm fairly well versed in various execution units
and latencies, but some mysteries remain.
Are there any experts in this field - maybe even some Intel
employees - that could improve the code further? Even one clock
cycle in a macro that will be executed a few quintillion times is
a big help.
The new assemply macros are at ftp://entropia.com/gimps/lucas1p.mac
for you to look at.
Questions: Why is the code faster when I throw in some
no-ops (actually fxch st(0) instructions)? How can I force the
CPU to execute the floating point micro-ops in the optimal order?
Does reordering the fstp instructions have any effect? Are there
other issues I sould consider?
Regards
George - who is looking forward to IA-64 where I am in control of
the opcode scheduling once again. Not to mention lots of registers!
P.S. The clock timings were measured using the following loop. I can
provide more details upon request.
mov al, 0
mov ecx, 250 ; 1000 iterations
clp1: disp four_complex_cpm_fft_3 8, 16, 32 ;;; or some other macro
lea esi, [esi+64]
add al, 256/4
jnc clp1
lea esi, [esi-256]
dec ecx ; Check loop counter
jnz clp1 ; Loop if necessary
________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm