Raymond Toy <[EMAIL PROTECTED]> writes:

> >>>>> "Gerd" == Gerd Moellmann <[EMAIL PROTECTED]> writes:
> 
>     Gerd> Christophe Rhodes <[EMAIL PROTECTED]> writes:
> 
>     >> * A smarter register allocator that chooses registers for the
>     >> innermost, not the outermost, loops;
> 
>     Gerd> ... and that does better floating-point register allocation on x86
> 
> Isn't this because you really can't "allocate" registers because the
> FPU is a stack?  

I'm not sure what is causing this.

> I also thought this didn't hurt much because the Pentium does
> register renaming on the FPU so those fxch instructions are very
> fast?

In the thread on cmucl-help starting with

  From: Nicolas Neuss <[EMAIL PROTECTED]>
  Subject: Performance problem
  Newsgroups: gmane.lisp.cmucl.general
  Date: 14 Sep 2002 15:13:24 +0200

I posted some code which modified a sequence of 

 (setf (aref entries pos2)
     (* 0.1111111111
        (+ (aref entries (the uint (+ pos2 -101)))  ; !! -1001
           (aref entries (the uint (+ pos2 -1)))
           (aref entries (the uint (+ pos2 99)))    ; !!   999
           (aref entries (the uint (+ pos2 -100)))  ; !! -1000
           (aref entries (the uint (+ pos2 0)))
           (aref entries (the uint (+ pos2 100)))   ; !!  1000
           (aref entries (the uint (+ pos2 -99)))   ; !!  -999
           (aref entries (the uint (+ pos2 1)))
           (aref entries (the uint (+ pos2 101))))))))))  ; !! 1001

in an inner loop, IIRC, to

  (let ((a0 (aref entries (the uint (- pos2 1001))))
        (a1 (aref entries (the uint (- pos2 1000))))
        (a2 (aref entries (the uint (- pos2 999))))
        (a3 (aref entries (the uint (+ pos2 999))))
        (a4 (aref entries (the uint (+ pos2 1000))))
        (a5 (aref entries (the uint (+ pos2 1001))))
        (a6 (aref entries (the uint (- pos2 1))))
        (a7 (aref entries (the uint (- pos2 0))))
        (a8 (aref entries (the uint (+ pos2 1)))))
     (setf (aref entries pos2)
           (* 0.1111111111 (+ a0 a1 a2 a3 a4 a5 a6 a7 a8))))))))

which resulted in much better assembler code because of basically
arranging the temporaries A_n in a stack-like manner.  The performance
figures were something like 1.22s for my version, and 1.49s for the
second best, not using this trick.

Some code in GCC must be doing something like this automatically, I
guess.

Reply via email to