I tested the whole thing again on my laptop (which was down for some
time). My favorite is now:
(defun test ()
(let* ((dim 2) (n 1000)
(entries (make-array (expt n dim) :element-type 'double-float
:initial-element 1.0d0)))
(declare (optimize (speed 3) (safety 0) (debug 0) (compilation-speed 0)))
(dotimes (i 10)
(loop
for pos1 of-type fixnum from 1 below (- n 1) do
(loop
for pos2 of-type fixnum from (+ pos1 n) by n below (+ pos1 (* n (- n 1)))
do
(setf (aref entries pos2)
(* 0.1111111111
(+ (aref entries (+ pos2 -1001))
(aref entries (+ pos2 -1))
(aref entries (+ pos2 999))
(aref entries (+ pos2 -1000))
(aref entries (+ pos2 0))
(aref entries (+ pos2 1000))
(aref entries (+ pos2 -999))
(aref entries (+ pos2 1))
(aref entries (+ pos2 1001))))))))))
This performs on my Pentium II (400MHz) as follows:
* ;;; Evaluate (time (test))
Compiling LAMBDA NIL:
Compiling Top-Level Form:
Evaluation took:
1.48 seconds of real time
1.42 seconds of user run time
0.06 seconds of system run time
0 page faults and
8000008 bytes consed.
NIL
The C-code performs as follows:
$ time a.out
1.000000
real 0m1.135s
user 0m1.050s
sys 0m0.070s
$
Thus, CMUCL is 40% worse than C. I guess the reason that I see worse
numbers now is that for my rather old computer the memory access is
not such a large factor than on newer machines. Therefore the
relatively bad machine code of CMUCL matters more.
A final look back to what I learned:
1. Don't put too many declarations in the code at the beginning.
2. Reduce the use your own functions/macros/types before posting the
code (a large performance problem for my initial code
was that my double-vec constructor was not defined inline).
3. I have recalled (compilation-speed 0). Even if it does not change
anything here, one should probably put it in the optimization list,
just in case.
(4. In custom generated code, I do not have to get the limits out of
arrays.)
5. Unfortunately, for very high performance needs on this kind of
application, Python still needs to be improved. For example, if I
would do several sweeps over small data ranges, the memory access
would get less important, and the difference between CMUCL and C
will become much larger due to the more complex code for performing
the addition:
CMUCL: something like
9F0: FADDD FR1
9F2: MOV EDI, ESI
9F4: ADD EDI, 4
9F7: FSTPD FR1
9F9: FLDD [EDX+EDI*2+1]
9FD: FXCH FR1
C:
0x804844c <main+92>: faddl 0xffffe0b8(%edx)
Yours, Nicolas.