Hi Greg, About a year ago I have looked to the CLA algorithm with the aim of porting it on GPUs.
In principle the algorithm should be an "easy" port (running entirely on the GPU) but at that time there were many SW engineering issues to overcome...more below. I would classify the CLA algorithm more like a molecular dynamic simulation than anything like a sparse matrix product. Maybe it could be mathematically reformulated as such, but I don't know about that. It is basically an algorithm where for each column (equivalent of a cortical micro-column) and for each synapse in the column you check around a certain cut-off radius for a binary match with the observed input (or synapses from other columns). A permanence value for the synapse in incremented or decremented based the fact you overlap or not with the inputs. Things are much more complicated (as there are many more phases) and for a better explanation of the algorithmic (with pseudo-code) you can refer to: http://numenta.org/resources/HTM_CorticalLearningAlgorithms.pdf The SW engineering problem I had back when looked into this was that the code was mostly Python and the early C++ implementation was with many encapsulated classes and layers like region->columns->cell->segments->synapses->etc. Each class had its own methods, attributes and properties. This for instance resulted in a nightmare trying to copy all the permanence of all the synapses to an array where all the GPU threads could work on. A natural parallelization would have been to assign different regions to different Kernels and to parallelize the permanence checks and updates for spatial and temporal polling in a kernel. Depending on which phase of the algorithm a GPU thread could be working on a column or a cell. I don't know what is the current status of the C++ effort, but my feeling with respect to the early stages is that the engine needed to be more lightweight and reorganized more to use structures of arrays (like for instance array of all the permanence values, etc) Oreste On Wed, May 7, 2014 at 7:35 PM, Ian Danforth <[email protected]>wrote: > Greg, > > Thanks for this detailed answer! If you ever feel interested in porting > CLA I'd love to work with you. A typical model today might maintain ~409600 > permanences for the spatial pooler (2048 columns * 400 bit input * .5 > potential pool) and then the temporal pooler has potentially 65536^2 > weights, but starts with a tiny tiny fraction of that and grows over time. > That growth process is probably the most tricky to manage. That said, the > way new connections are chosen could probably be done in a more GPU > friendly way if we understood the most efficient way to do so. > > Ian > > > > > On Wed, May 7, 2014 at 5:22 PM, Greg Diamos <[email protected]>wrote: > >> CUDA 6 is mainly about removing the need for programmers to use explicit >> DMA copies >> to move data structures between CPU and GPU memory. For applications >> that just need >> to copy inputs to the GPU, perform some computationally dense operation, >> then read >> a few results back, performance should be similar to DMA operations. >> >> You typically can get around 12GB/s for DMA copies between the CPU and >> GPU for >> DMA sizes larger than 32KB. This is potentially less bandwidth than the >> CPU can get >> out of DRAM (usually something like 26GB/s or 52 GB/s), and the access >> size is >> substantially bigger. >> >> I don't know enough about CLA to say how difficult a GPU port would be, >> but I do have some >> experience with porting deep neural networks to GPUs, and maybe someone >> with more >> experience with the CLA could comment on similarities and differences. >> >> Deep neural networks are actually extremely easy to port to GPUs. You >> model them >> as an array of layers (each represented by a dense or block-sparse >> matrix). Each entry >> in one of these matrices is a floating point value that represents the >> weight for a neuron >> connection. You load these matrices directly in GPU memory when the >> program starts, >> and doing this is just about as fast a loading them into CPU memory >> (since you are probably >> loading it from disk and the CPU->GPU link is about 10-50x higher >> bandwidth than a disk). >> You then load a batch of samples into GPU memory, and propagate them back >> and forth >> through the network. The bulk of the work to propagate a set of samples >> through one layer >> in the network is performed using a dense matrix-matrix multiply >> operation for a fully-connected >> layer, or a batched (parallel) set of smaller matrix-matrix multiplies >> for a >> locally-connected/convolutional layer. This operation is both extremely >> compute-intensive, >> and extremely parallel. It is straightforward to achieve high >> percentages of peak throughput, >> and there is no need to write any GPU code. You can just call into a >> matrix library, or an image >> processing library when the tiles get small enough (and start using >> shared weights) that they >> start looking more like convolutions than general matrix multiplies. >> >> GPU memories are in the range of 1-12GB, so you can model a network with >> something like a billion connections on a single GPU. If you have more >> samples than will >> fit into GPU memory, then you batch them and train stochastically over >> multiple iterations. Again, >> you are typically either bottlenecked by the network evaluation (for a >> modest to large network), >> or the speed at which you can read them off of disk (for a small network). >> >> DMA throughput or latency doesn't ever come into the picture because all >> of the work is performed >> by the GPU and all of the data stays on the GPU until you are ready to >> assign a prediction to each sample >> or save the updated model back to disk. >> >> A straightforward way to port the CLA would be to adopt a similar model. >> Load the entire model into GPU memory, >> then run samples through it. It would be especially useful to be able to >> cast the bulk of the work performed by >> CLA in terms of linear algebra or other well known operations like >> convolution because it is actually quite >> difficult (i.e. time consuming) to write high performance low level code. >> Note that this isn't just >> true for GPUs, most people can't write a dense matrix library that >> competes with MKL on Intel CPUs either. >> >> Greg >> >> >> >> >> On Wed, May 7, 2014 at 2:54 PM, Doug King <[email protected]> wrote: >> >>> Hideaki your PDF is a good starting point for implementing a GPU >>> accelerated spatial pooler. >>> >>> I have pointed to this before - some work has been done assessing the >>> CLA (based on CLA whitepaper 1.0) and opportunities to parallelize it with >>> GPU. http://pdxscholar.library.pdx.edu/open_access_etds/202/ >>> >>> RE: Cuda 6, Latency is still an issue. Cuda 6 makes memory management >>> simpler. But memory copies between the system and GPU memory still need to >>> be done and the latencies remain. On the upside CUDA 6 handles the >>> programming of data block transfers for developers transparently. In >>> contrast, AMD's HSA architecture implements truly unified memory access and >>> reduces latency. I am sure Nvidia will be close behind and next iteration >>> will have a similar unified architecture that reduces latency. Moving >>> memory form CPU to GPU and back kills a lot of the performance gains when >>> it comes to the CLA. >>> >>> Hideaki, you are correct, the synapse traversals in the CLA make it a >>> poor fit for current GPU architecture. You have to constantly reach out to >>> other memory blocks to get the data you need to perform updates. >>> Spatial pooler is a better fit for current generation of GPU's and we could >>> start with spatial pooler GPU acceleration and gain some benefits right now. >>> >>> >>> >>> >>> On Wed, May 7, 2014 at 1:10 PM, Hideaki Suzuki <[email protected]>wrote: >>> >>>> Hi Sergey, >>>> >>>> Thank you for the link. This is really a cool idea. The recent high >>>> end GPU has a few GB of memory on the device even for gaming like GTX 780 >>>> Ti. It looks like the main memory would be L3 or L4 cache for the memory >>>> on the GPU device and CPU is like a co-processor for some kind of >>>> computational tasks. ;-) >>>> >>>> >>>> Unfortunately, I'm not aware of a progress about Nupic on GPU. I guess >>>> the difficulty comes from 1) the parallel link (=synapse) traversals in CLA >>>> and 2) a solid C code base that can be ported to GPU kernel. >>>> >>>> Once nupic.core becomes ready, it may be a good time to try this >>>> porting out. In addition, CUDA 6 has been released recently which will make >>>> programmer's lives easier. So the situation is better than ever. I >>>> believe... [*] >>>> >>>> A few months ago, I draw a sketch on a piece of paper about how to run >>>> SP on GPU. I picked it up and transferred my dirty hand writing to a pdf >>>> file, in hope that you may find it interesting... PFA. >>>> (note: my sketch is not about a porting design but about a GPU-oriented >>>> design unrelated to Nupic) >>>> >>>> [*] https://developer.nvidia.com/cuda-toolkit >>>> >>>> >>>> _______________________________________________ >>>> nupic mailing list >>>> [email protected] >>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>> >>>> >>> >>> _______________________________________________ >>> nupic mailing list >>> [email protected] >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>> >>> >> >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > >
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
