Re: [Jchat] J and education

Scott Locklin Thu, 28 Aug 2014 18:52:13 -0700

Alexgian wrote:

>Haven't had the chance to fully check it out yet, but it seems theyuse AI techniques to compile

>the array operations into something that makes sense to the GPU cores

>(OpenCL)! The actual coding looks like Scheme. I think that it's anhonourable effort.

There are a few things like that out there. Theano is the Python versionof the idea; make some C/GPU primitives, compile them, and relink themwith the interpretor. Torch7 is a little closer to being a native GPUinterpretor. If you stick to BLAS primitives, all you have to do isdeclare your variables :cuda(), and your code will run natively on theGPU. For a simple mmul, comparing openBLAS/sgemm to CUDA/sgemm, you canget some substantial speedups.

Just to give an idea of the numbers involved; on my machine, openBLASdgemm (a super optimized, threaded library) takes about 2.2 seconds tomultiply a couple of 3000^2 matrices. sgemm (single precision) takeshalf the time; it's all memory bound. CUDA/sgemm takes 0.8 milliseconds.Of course, it takes 25 milliseconds to get the result out of the GPU,but that's still pretty good. FWIIW, J does respectably well with doubleprecision matrix mult, considering it is interpreted code running singlethreaded; about 85 seconds. This actually beats a lot of naive compiledcode (a pal's native D version of mmul took 300 seconds), which is atribute to J's power. The trick to all this kind of code is cache awareness.

While it is unpleasant to write code in terms of BLAS calls, J couldmatch all this with some FFI out to openBLAS and CUDA. Certainly therewould be a huge increase in my personal productivity in preprocessingthe data or writing filters.

Anyway, just wool gathering here. One of those "maybe I will get to itone of these days" things.


-SL
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jchat] J and education

Reply via email to