Umm... decoding optimization (4-1-1 rule)
For example in four_complex_cpm_fft_3:
;;1-1-1
        fld     R6                      ;; I2,I3,A2,r/i,A4,I1,R3,R1
        fmul    st(3), st               ;; B2 = I2 * r/i
;23-27
        fsubp   st(2), st               ;; A2 = A2 - I2
;24-26
;;1 (D1, D2 stall)
        fld     R8                      ;; I4,I3,A2,B2,A4,I1,R3,R1
;;2-1 (D2 stall)
        fmul    QWORD PTR [edi+24]      ;; B4 = I4 * r/i
;25-29
        fxch    st(4)                   ;; A4,I3,A2,B2,B4,I1,R3,R1
;;2-1 (D2 stall)
        fsub    R8                      ;; A4 = A4 - I4
;26-28
        fxch    st(2)                   ;; A2,I3,A4,B2,B4,I1,R3,R1

Plus all the stores are decoded in separate cycles (2 uOps)
I'm sure someone else will correct my mistakes ;)

I'm sure you checked cache alignments... I can't think of anything else
offhand...

Also, I noticed that no attention was paid to as far as K6 optimization (ie
tossing the fxch's) in the current code... Any effort to improve that or is
it not worth it?

-----Original Message-----
From: George Woltman [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, June 09, 1999 10:19 PM
To: [EMAIL PROTECTED]
Subject: Mersenne: Pentium Pro Optimization Help Needed


Hi all,

        I'm trying to optimize prime95 for the Pentium Pro/PII/PIII
architecture.  I'm fairly well versed in various execution units
and latencies, but some mysteries remain.

        Are there any experts in this field - maybe even some Intel
employees - that could improve the code further?  Even one clock 
cycle in a macro that will be executed a few quintillion times is
a big help.

        The new assemply macros are at ftp://entropia.com/gimps/lucas1p.mac
for you to look at.

        Questions:  Why is the code faster when I throw in some
no-ops (actually fxch st(0) instructions)?  How can I force the
CPU to execute the floating point micro-ops in the optimal order?
Does reordering the fstp instructions have any effect?  Are there
other issues I sould consider?

Regards
George - who is looking forward to IA-64 where I am in control of
the opcode scheduling once again.  Not to mention lots of registers!

P.S.    The clock timings were measured using the following loop.  I can
provide more details upon request.
        mov     al, 0
        mov     ecx, 250                ; 1000 iterations
clp1:   disp four_complex_cpm_fft_3 8, 16, 32           ;;; or some other
macro
        lea     esi, [esi+64]
        add     al, 256/4
        jnc     clp1
        lea     esi, [esi-256]
        dec     ecx                     ; Check loop counter
        jnz     clp1                    ; Loop if necessary


________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm

Reply via email to