I tested the whole thing again on my laptop (which was down for some
time).  My favorite is now:

(defun test ()
  (let* ((dim 2) (n 1000)
         (entries (make-array (expt n dim) :element-type 'double-float 
:initial-element 1.0d0)))
    (declare (optimize (speed 3) (safety 0) (debug 0) (compilation-speed 0)))
    (dotimes (i 10)
      (loop
       for pos1 of-type fixnum from 1 below (- n 1) do
       (loop
        for pos2 of-type fixnum from (+ pos1 n) by n below (+ pos1 (* n (- n 1)))
        do
        (setf (aref entries pos2)
              (* 0.1111111111
                 (+ (aref entries (+ pos2 -1001))
                    (aref entries (+ pos2 -1))
                    (aref entries (+ pos2 999))
                    (aref entries (+ pos2 -1000))
                    (aref entries (+ pos2 0))
                    (aref entries (+ pos2 1000))
                    (aref entries (+ pos2 -999))
                    (aref entries (+ pos2 1))
                    (aref entries (+ pos2 1001))))))))))

This performs on my Pentium II (400MHz) as follows:

* ;;; Evaluate (time (test))
Compiling LAMBDA NIL: 
Compiling Top-Level Form: 
 
Evaluation took:
  1.48 seconds of real time
  1.42 seconds of user run time
  0.06 seconds of system run time
  0 page faults and
  8000008 bytes consed.
NIL

The C-code performs as follows:

$ time a.out
1.000000
real    0m1.135s
user    0m1.050s
sys     0m0.070s
$ 

Thus, CMUCL is 40% worse than C.  I guess the reason that I see worse
numbers now is that for my rather old computer the memory access is
not such a large factor than on newer machines.  Therefore the
relatively bad machine code of CMUCL matters more.

A final look back to what I learned:

1. Don't put too many declarations in the code at the beginning.

2. Reduce the use your own functions/macros/types before posting the
   code (a large performance problem for my initial code
   was that my double-vec constructor was not defined inline).

3. I have recalled (compilation-speed 0).  Even if it does not change
   anything here, one should probably put it in the optimization list,
   just in case.

(4. In custom generated code, I do not have to get the limits out of
   arrays.)

5. Unfortunately, for very high performance needs on this kind of
   application, Python still needs to be improved.  For example, if I
   would do several sweeps over small data ranges, the memory access
   would get less important, and the difference between CMUCL and C
   will become much larger due to the more complex code for performing
   the addition:

   CMUCL: something like
     9F0:       FADDD FR1
     9F2:       MOV   EDI, ESI
     9F4:       ADD   EDI, 4
     9F7:       FSTPD FR1
     9F9:       FLDD  [EDX+EDI*2+1]
     9FD:       FXCH  FR1

  C:
     0x804844c <main+92>:       faddl  0xffffe0b8(%edx)


Yours, Nicolas.


Reply via email to