On 01/21/2013 04:33 PM, Berthold Bäuml wrote:


I just did that. Here are the types:

  real-matrix* : (Array Real) (Array Real) -> (Array Real)

  flonum-matrix* : (Array Flonum) (Array Flonum) -> (Array Flonum)

  flmatrix* : FlArray FlArray -> FlArray

Results so far, measured in DrRacket with debugging off:

Function           Size              Time
-----------------------------------------
matrix*            100x100          340ms
real-matrix*       100x100           40ms
flonum-matrix*     100x100           10ms
flmatrix*          100x100            6ms

matrix*           1000x1000      418000ms
real-matrix*      1000x1000       76000ms
flonum-matrix*    1000x1000        7000ms
flmatrix*         1000x1000        4900ms

The only difference between `real-matrix*' and `flonum-matrix*' is that the 
former uses `+' and `*' and the latter uses `fl+' and `fl*'. But if I can 
inline `real-matrix*', TR's optimizer will change the former to the latter, 
making `flonum-matrix*' unnecessary. (FWIW, this would be the largest speedup 
TR's optimizer will have ever shown me.)

It looks like the biggest speedup comes from doing only flonum ops in the inner 
loop sum, which keeps all the intermediate flonums unboxed (i.e. not 
heap-allocated or later garbage-collected).

Right now, `flmatrix*' is implemented a bit stupidly, so I could speed it up 
further. I won't yet, because I haven't settled on a type for matrices of 
unboxed flonums. The type has to work with LAPACK if it's installed, which 
`FlArray' doesn't do because its data layout is row-major and LAPACK expects 
column-major.

I'll change `matrix*' to look like `real-matrix*'. It won't give the very best 
performance, but it's a 60x speedup for 1000x1000 matrices.


These results look very promising, esp. if , as you mentioned, in the end the 
real-matrix* will automatically reach the flonum-matrix* performance for 
Flonums and the flmatrix* automatically switches to a LAPACK based variant, 
when available. For the latter it would be great if one could even change the 
used library to, e.g., redirect to a installation of the highly efficient MKL 
library from Intel.

Looking forward to benchmark it against Numpy and Mathematica (which is MKL 
based) again!

The next release candidate will have a `matrix*' that runs as fast as `flonum-matrix*' on (Matrix Flonum) arguments, and as fast as `real-matrix*' on (Matrix Real) arguments.

(Curiously, I wasn't able to speed up `flmatrix*' at all. The "a bit stupidly" implementation was actually the fastest I could do, which probably means it's as fast as it can be done in Racket without dropping to C.)

Additionally, all the matrix functions will have more precise types to make it easy for TR to prove that operations on (Matrix Flonum) values yield (Matrix Flonum) values. (Currently, it only proves (Matrix Real) and (Matrix Number).) Example:

  > (:print-type (matrix-qr (matrix [[1.0 2.0] [3.0 4.0]])))
  (Values (Array Flonum) (Array Flonum))
  > (:print-type (matrix-qr (matrix [[1.0+0.0i 2.0+0.0i]
                                     [3.0+0.0i 4.0+0.0i]])))
  (Values (Array Float-Complex) (Array Float-Complex))

The current release candidate returns two (Array Real) and (Array Number) instead.

Since I've brought up decompositions, you should know that these, `matrix-inverse', `matrix-solve', and pretty much every O(n^3) operation except multiplication will be slow on large matrices. I can't think of a way to inline them in a way that TR can optimize. For fast versions, which will use LAPACK if it's installed, you'll have to wait for 5.3.3. :/

Is the MKL library binary-compatible with LAPACK?

Neil ⊥

____________________
 Racket Users list:
 http://lists.racket-lang.org/users

Reply via email to