Alexgian wrote:
>Haven't had the chance to fully check it out yet, but it seems they use AI techniques to compile
>the array operations into something that makes sense to the GPU cores
>(OpenCL)! The actual coding looks like Scheme. I think that it's an honourable effort.

There are a few things like that out there. Theano is the Python version of the idea; make some C/GPU primitives, compile them, and relink them with the interpretor. Torch7 is a little closer to being a native GPU interpretor. If you stick to BLAS primitives, all you have to do is declare your variables :cuda(), and your code will run natively on the GPU. For a simple mmul, comparing openBLAS/sgemm to CUDA/sgemm, you can get some substantial speedups.

Just to give an idea of the numbers involved; on my machine, openBLAS dgemm (a super optimized, threaded library) takes about 2.2 seconds to multiply a couple of 3000^2 matrices. sgemm (single precision) takes half the time; it's all memory bound. CUDA/sgemm takes 0.8 milliseconds. Of course, it takes 25 milliseconds to get the result out of the GPU, but that's still pretty good. FWIIW, J does respectably well with double precision matrix mult, considering it is interpreted code running single threaded; about 85 seconds. This actually beats a lot of naive compiled code (a pal's native D version of mmul took 300 seconds), which is a tribute to J's power. The trick to all this kind of code is cache awareness.

While it is unpleasant to write code in terms of BLAS calls, J could match all this with some FFI out to openBLAS and CUDA. Certainly there would be a huge increase in my personal productivity in preprocessing the data or writing filters.

Anyway, just wool gathering here. One of those "maybe I will get to it one of these days" things.

-SL
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to