CUDA 6 is mainly about removing the need for programmers to use explicit DMA copies to move data structures between CPU and GPU memory. For applications that just need to copy inputs to the GPU, perform some computationally dense operation, then read a few results back, performance should be similar to DMA operations.
You typically can get around 12GB/s for DMA copies between the CPU and GPU for DMA sizes larger than 32KB. This is potentially less bandwidth than the CPU can get out of DRAM (usually something like 26GB/s or 52 GB/s), and the access size is substantially bigger. I don't know enough about CLA to say how difficult a GPU port would be, but I do have some experience with porting deep neural networks to GPUs, and maybe someone with more experience with the CLA could comment on similarities and differences. Deep neural networks are actually extremely easy to port to GPUs. You model them as an array of layers (each represented by a dense or block-sparse matrix). Each entry in one of these matrices is a floating point value that represents the weight for a neuron connection. You load these matrices directly in GPU memory when the program starts, and doing this is just about as fast a loading them into CPU memory (since you are probably loading it from disk and the CPU->GPU link is about 10-50x higher bandwidth than a disk). You then load a batch of samples into GPU memory, and propagate them back and forth through the network. The bulk of the work to propagate a set of samples through one layer in the network is performed using a dense matrix-matrix multiply operation for a fully-connected layer, or a batched (parallel) set of smaller matrix-matrix multiplies for a locally-connected/convolutional layer. This operation is both extremely compute-intensive, and extremely parallel. It is straightforward to achieve high percentages of peak throughput, and there is no need to write any GPU code. You can just call into a matrix library, or an image processing library when the tiles get small enough (and start using shared weights) that they start looking more like convolutions than general matrix multiplies. GPU memories are in the range of 1-12GB, so you can model a network with something like a billion connections on a single GPU. If you have more samples than will fit into GPU memory, then you batch them and train stochastically over multiple iterations. Again, you are typically either bottlenecked by the network evaluation (for a modest to large network), or the speed at which you can read them off of disk (for a small network). DMA throughput or latency doesn't ever come into the picture because all of the work is performed by the GPU and all of the data stays on the GPU until you are ready to assign a prediction to each sample or save the updated model back to disk. A straightforward way to port the CLA would be to adopt a similar model. Load the entire model into GPU memory, then run samples through it. It would be especially useful to be able to cast the bulk of the work performed by CLA in terms of linear algebra or other well known operations like convolution because it is actually quite difficult (i.e. time consuming) to write high performance low level code. Note that this isn't just true for GPUs, most people can't write a dense matrix library that competes with MKL on Intel CPUs either. Greg On Wed, May 7, 2014 at 2:54 PM, Doug King <[email protected]> wrote: > Hideaki your PDF is a good starting point for implementing a GPU > accelerated spatial pooler. > > I have pointed to this before - some work has been done assessing the CLA > (based on CLA whitepaper 1.0) and opportunities to parallelize it with > GPU. http://pdxscholar.library.pdx.edu/open_access_etds/202/ > > RE: Cuda 6, Latency is still an issue. Cuda 6 makes memory management > simpler. But memory copies between the system and GPU memory still need to > be done and the latencies remain. On the upside CUDA 6 handles the > programming of data block transfers for developers transparently. In > contrast, AMD's HSA architecture implements truly unified memory access and > reduces latency. I am sure Nvidia will be close behind and next iteration > will have a similar unified architecture that reduces latency. Moving > memory form CPU to GPU and back kills a lot of the performance gains when > it comes to the CLA. > > Hideaki, you are correct, the synapse traversals in the CLA make it a > poor fit for current GPU architecture. You have to constantly reach out to > other memory blocks to get the data you need to perform updates. Spatial > pooler is a better fit for current generation of GPU's and we could start > with spatial pooler GPU acceleration and gain some benefits right now. > > > > > On Wed, May 7, 2014 at 1:10 PM, Hideaki Suzuki <[email protected]> wrote: > >> Hi Sergey, >> >> Thank you for the link. This is really a cool idea. The recent high end >> GPU has a few GB of memory on the device even for gaming like GTX 780 Ti. >> It looks like the main memory would be L3 or L4 cache for the memory on >> the GPU device and CPU is like a co-processor for some kind of >> computational tasks. ;-) >> >> >> Unfortunately, I'm not aware of a progress about Nupic on GPU. I guess >> the difficulty comes from 1) the parallel link (=synapse) traversals in CLA >> and 2) a solid C code base that can be ported to GPU kernel. >> >> Once nupic.core becomes ready, it may be a good time to try this porting >> out. In addition, CUDA 6 has been released recently which will make >> programmer's lives easier. So the situation is better than ever. I >> believe... [*] >> >> A few months ago, I draw a sketch on a piece of paper about how to run SP >> on GPU. I picked it up and transferred my dirty hand writing to a pdf >> file, in hope that you may find it interesting... PFA. >> (note: my sketch is not about a porting design but about a GPU-oriented >> design unrelated to Nupic) >> >> [*] https://developer.nvidia.com/cuda-toolkit >> >> >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > >
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
