Hi all,

        I'm trying to optimize prime95 for the Pentium Pro/PII/PIII
architecture.  I'm fairly well versed in various execution units
and latencies, but some mysteries remain.

        Are there any experts in this field - maybe even some Intel
employees - that could improve the code further?  Even one clock 
cycle in a macro that will be executed a few quintillion times is
a big help.

        The new assemply macros are at ftp://entropia.com/gimps/lucas1p.mac
for you to look at.

        Questions:  Why is the code faster when I throw in some
no-ops (actually fxch st(0) instructions)?  How can I force the
CPU to execute the floating point micro-ops in the optimal order?
Does reordering the fstp instructions have any effect?  Are there
other issues I sould consider?

Regards
George - who is looking forward to IA-64 where I am in control of
the opcode scheduling once again.  Not to mention lots of registers!

P.S.    The clock timings were measured using the following loop.  I can
provide more details upon request.
        mov     al, 0
        mov     ecx, 250                ; 1000 iterations
clp1:   disp four_complex_cpm_fft_3 8, 16, 32           ;;; or some other macro
        lea     esi, [esi+64]
        add     al, 256/4
        jnc     clp1
        lea     esi, [esi-256]
        dec     ecx                     ; Check loop counter
        jnz     clp1                    ; Loop if necessary


________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm

Reply via email to