On 01/20/2013 11:14 AM, Jens Axel Søgaard wrote:
2013/1/20 Berthold Bäuml <berthold.bae...@dlr.de>:

With the ffi-binding example from Jens (thank you!) I get for the 1000x1000
multiply 450ms -- only 2x slower than Mathematica or Numpy. So integrating
such a binding in math/matrix looks promising.

Huh? I am surprised it is twice as slow.

Ah! The Numpy example uses floats and not doubles.

It may be FFI overhead as well.

Nevertheless, my original motivation for the little test was to get an 
impression of
what performance could be achieved in purely Typed Racket for numerical
algorithms. Would it in principle be possible to come close to pure C-code
performance (using the same algorithm) when using a floating-point-
implementation?

Let's try and write a matrix* that works for matrices represented as
vectors of floats and see how fast it is.

I just did that. Here are the types:

  real-matrix* : (Array Real) (Array Real) -> (Array Real)

  flonum-matrix* : (Array Flonum) (Array Flonum) -> (Array Flonum)

  flmatrix* : FlArray FlArray -> FlArray

Results so far, measured in DrRacket with debugging off:

Function           Size              Time
-----------------------------------------
matrix*            100x100          340ms
real-matrix*       100x100           40ms
flonum-matrix*     100x100           10ms
flmatrix*          100x100            6ms

matrix*           1000x1000      418000ms
real-matrix*      1000x1000       76000ms
flonum-matrix*    1000x1000        7000ms
flmatrix*         1000x1000        4900ms

The only difference between `real-matrix*' and `flonum-matrix*' is that the former uses `+' and `*' and the latter uses `fl+' and `fl*'. But if I can inline `real-matrix*', TR's optimizer will change the former to the latter, making `flonum-matrix*' unnecessary. (FWIW, this would be the largest speedup TR's optimizer will have ever shown me.)

It looks like the biggest speedup comes from doing only flonum ops in the inner loop sum, which keeps all the intermediate flonums unboxed (i.e. not heap-allocated or later garbage-collected).

Right now, `flmatrix*' is implemented a bit stupidly, so I could speed it up further. I won't yet, because I haven't settled on a type for matrices of unboxed flonums. The type has to work with LAPACK if it's installed, which `FlArray' doesn't do because its data layout is row-major and LAPACK expects column-major.

I'll change `matrix*' to look like `real-matrix*'. It won't give the very best performance, but it's a 60x speedup for 1000x1000 matrices.

Neil ⊥

____________________
 Racket Users list:
 http://lists.racket-lang.org/users

Reply via email to