Alexgian wrote:
>Haven't had the chance to fully check it out yet, but it seems they
use AI techniques to compile
>the array operations into something that makes sense to the GPU cores
>(OpenCL)! The actual coding looks like Scheme. I think that it's an
honourable effort.
There are a few things like that out there. Theano is the Python version
of the idea; make some C/GPU primitives, compile them, and relink them
with the interpretor. Torch7 is a little closer to being a native GPU
interpretor. If you stick to BLAS primitives, all you have to do is
declare your variables :cuda(), and your code will run natively on the
GPU. For a simple mmul, comparing openBLAS/sgemm to CUDA/sgemm, you can
get some substantial speedups.
Just to give an idea of the numbers involved; on my machine, openBLAS
dgemm (a super optimized, threaded library) takes about 2.2 seconds to
multiply a couple of 3000^2 matrices. sgemm (single precision) takes
half the time; it's all memory bound. CUDA/sgemm takes 0.8 milliseconds.
Of course, it takes 25 milliseconds to get the result out of the GPU,
but that's still pretty good. FWIIW, J does respectably well with double
precision matrix mult, considering it is interpreted code running single
threaded; about 85 seconds. This actually beats a lot of naive compiled
code (a pal's native D version of mmul took 300 seconds), which is a
tribute to J's power. The trick to all this kind of code is cache awareness.
While it is unpleasant to write code in terms of BLAS calls, J could
match all this with some FFI out to openBLAS and CUDA. Certainly there
would be a huge increase in my personal productivity in preprocessing
the data or writing filters.
Anyway, just wool gathering here. One of those "maybe I will get to it
one of these days" things.
-SL
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm