Hi Morten,

On 04/09/2013, at 9:53 PM, Morten Olsen Lysgaard <mor...@lysgaard.no> wrote:

> I've been trying to get some speed out of the accelerate library today.
> What I want to implement is something as simple as a matrix multiply.
> I'd like it to be fast and memory efficient.

Well, the trouble with something like matrix multiply is that it is only simple 
conceptually. Making it fast and memory [bandwidth] efficient is another matter 
entirely, GPU or otherwise.

> Anyone know how to achieve this with accelerate? My first thought was
> to use the generate function to create the new C array, but I didn't
> manage to wrap my head around all the fancy type features that pop up
> when you want to return an array C that has dimensions dependent on
> the dimensions of it's inputs, A and B.

If in doubt, add a type signature; sometimes you need to help GHC's type 
checker along a bit. Typically this comes up when constructing/deconstructing 
things using lift/unlift respectively, which you'll need to do to unpack shapes 
& indices.


> I've search around a bit and found this [1] example implementation but
> it is just as slow as a simple sequential algorithm in C.

What kind of GPU are you running on?
Also I haven't benchmarked that matrix multiply code, but I am not overly 
surprised --- it wasn't designed to be efficient.


> Here's a snippet of what I have tried to make. There are several
> errors in there. Maybe I'm approaching the problem from the wrong
> angle.
> 
> matMul' arr brr =
>  let dotProd shp =
>        let (Z :. rowsA :. _)     = unlift (shape arr)    :: (Z :. Exp
> Int :. Exp Int)
>            (Z :. _     :. colsB) = unlift (shape brr)    :: (Z :. Exp
> Int :. Exp Int)
>            (Z :. i :. j) = unlift shp :: (Z :. Exp Int :. Exp Int)
>            rs = lift (Z :. All :.) (unlift i)
>            cs = (lift (Z :.) (unlift j)) (:. All)
>        in the $ A.fold1All (+) $ A.zipWith (+) (flatten (slice arr
> rs)) (flatten (slice brr cs))
>  in A.generate (lift (Z :. rowsA :. colsB)) dotProd

Yeah, this introduces nested parallelism, because each thread (scalar 
computation) in the generate function is trying to initiate new parallel 
computation via dotProd. Anyway, even if it did work, you'd still perform the 
same number of memory transfers as the replicate-based version you linked to 
earlier [1]. There is a bit more of an explanation of how reduction works in 
Accelerate at [2], slide 44.

I am thinking about how to implement some shared memory optimisations, but that 
isn't likely to be ready soon.


If you want truly fast matrix multiply, Accelerate has an escape hatch --- 
there is a foreign function interface, so you can just call into the matrix 
multiply routine provided by, for example, the CUBLAS library. 


Cheers,
-Trev


[1] http://www.mail-archive.com/haskell-cafe@haskell.org/msg102746.html
[2] 
http://www.cse.unsw.edu.au/~tmcdonell/presentations/2013-lambdajam-workshop.pdf



_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to