What about OpenCL?
On Thu, May 8, 2014 at 2:46 PM, Oreste Villa <[email protected]> wrote: > Hi Greg, > > About a year ago I have looked to the CLA algorithm with the aim of > porting it on GPUs. > > In principle the algorithm should be an "easy" port (running entirely on > the GPU) but at that time there were many SW engineering issues to > overcome...more below. > > I would classify the CLA algorithm more like a molecular dynamic > simulation than anything like a sparse matrix product. Maybe it could be > mathematically reformulated as such, but I don't know about that. > > It is basically an algorithm where for each column (equivalent of a > cortical micro-column) and for each synapse in the column you check around > a certain cut-off radius for a binary match with the observed input (or > synapses from other columns). A permanence value for the synapse in > incremented or decremented based the fact you overlap or not with the > inputs. Things are much more complicated (as there are many more phases) > and for a better explanation of the algorithmic (with pseudo-code) you can > refer to: > > http://numenta.org/resources/HTM_CorticalLearningAlgorithms.pdf > > The SW engineering problem I had back when looked into this was that the > code was mostly Python and the early C++ implementation was with many > encapsulated classes and layers like > region->columns->cell->segments->synapses->etc. > Each class had its own methods, attributes and properties. This for > instance resulted in a nightmare trying to copy all the permanence of all > the synapses to an array where all the GPU threads could work on. > > A natural parallelization would have been to assign different regions to > different Kernels and to parallelize the permanence checks and updates for > spatial and temporal polling in a kernel. Depending on which phase of the > algorithm a GPU thread could be working on a column or a cell. > > I don't know what is the current status of the C++ effort, but my feeling > with respect to the early stages is that the engine needed to be more > lightweight and reorganized more to use structures of arrays (like for > instance array of all the permanence values, etc) > > > Oreste > > > On Wed, May 7, 2014 at 7:35 PM, Ian Danforth <[email protected]>wrote: > >> Greg, >> >> Thanks for this detailed answer! If you ever feel interested in porting >> CLA I'd love to work with you. A typical model today might maintain ~409600 >> permanences for the spatial pooler (2048 columns * 400 bit input * .5 >> potential pool) and then the temporal pooler has potentially 65536^2 >> weights, but starts with a tiny tiny fraction of that and grows over time. >> That growth process is probably the most tricky to manage. That said, the >> way new connections are chosen could probably be done in a more GPU >> friendly way if we understood the most efficient way to do so. >> >> Ian >> >> >> >> >> On Wed, May 7, 2014 at 5:22 PM, Greg Diamos <[email protected]>wrote: >> >>> CUDA 6 is mainly about removing the need for programmers to use explicit >>> DMA copies >>> to move data structures between CPU and GPU memory. For applications >>> that just need >>> to copy inputs to the GPU, perform some computationally dense operation, >>> then read >>> a few results back, performance should be similar to DMA operations. >>> >>> You typically can get around 12GB/s for DMA copies between the CPU and >>> GPU for >>> DMA sizes larger than 32KB. This is potentially less bandwidth than the >>> CPU can get >>> out of DRAM (usually something like 26GB/s or 52 GB/s), and the access >>> size is >>> substantially bigger. >>> >>> I don't know enough about CLA to say how difficult a GPU port would be, >>> but I do have some >>> experience with porting deep neural networks to GPUs, and maybe someone >>> with more >>> experience with the CLA could comment on similarities and differences. >>> >>> Deep neural networks are actually extremely easy to port to GPUs. You >>> model them >>> as an array of layers (each represented by a dense or block-sparse >>> matrix). Each entry >>> in one of these matrices is a floating point value that represents the >>> weight for a neuron >>> connection. You load these matrices directly in GPU memory when the >>> program starts, >>> and doing this is just about as fast a loading them into CPU memory >>> (since you are probably >>> loading it from disk and the CPU->GPU link is about 10-50x higher >>> bandwidth than a disk). >>> You then load a batch of samples into GPU memory, and propagate them >>> back and forth >>> through the network. The bulk of the work to propagate a set of samples >>> through one layer >>> in the network is performed using a dense matrix-matrix multiply >>> operation for a fully-connected >>> layer, or a batched (parallel) set of smaller matrix-matrix multiplies >>> for a >>> locally-connected/convolutional layer. This operation is both extremely >>> compute-intensive, >>> and extremely parallel. It is straightforward to achieve high >>> percentages of peak throughput, >>> and there is no need to write any GPU code. You can just call into a >>> matrix library, or an image >>> processing library when the tiles get small enough (and start using >>> shared weights) that they >>> start looking more like convolutions than general matrix multiplies. >>> >>> GPU memories are in the range of 1-12GB, so you can model a network with >>> something like a billion connections on a single GPU. If you have more >>> samples than will >>> fit into GPU memory, then you batch them and train stochastically over >>> multiple iterations. Again, >>> you are typically either bottlenecked by the network evaluation (for a >>> modest to large network), >>> or the speed at which you can read them off of disk (for a small >>> network). >>> >>> DMA throughput or latency doesn't ever come into the picture because all >>> of the work is performed >>> by the GPU and all of the data stays on the GPU until you are ready to >>> assign a prediction to each sample >>> or save the updated model back to disk. >>> >>> A straightforward way to port the CLA would be to adopt a similar model. >>> Load the entire model into GPU memory, >>> then run samples through it. It would be especially useful to be able >>> to cast the bulk of the work performed by >>> CLA in terms of linear algebra or other well known operations like >>> convolution because it is actually quite >>> difficult (i.e. time consuming) to write high performance low level >>> code. Note that this isn't just >>> true for GPUs, most people can't write a dense matrix library that >>> competes with MKL on Intel CPUs either. >>> >>> Greg >>> >>> >>> >>> >>> On Wed, May 7, 2014 at 2:54 PM, Doug King <[email protected]> wrote: >>> >>>> Hideaki your PDF is a good starting point for implementing a GPU >>>> accelerated spatial pooler. >>>> >>>> I have pointed to this before - some work has been done assessing the >>>> CLA (based on CLA whitepaper 1.0) and opportunities to parallelize it with >>>> GPU. http://pdxscholar.library.pdx.edu/open_access_etds/202/ >>>> >>>> RE: Cuda 6, Latency is still an issue. Cuda 6 makes memory management >>>> simpler. But memory copies between the system and GPU memory still need to >>>> be done and the latencies remain. On the upside CUDA 6 handles the >>>> programming of data block transfers for developers transparently. In >>>> contrast, AMD's HSA architecture implements truly unified memory access and >>>> reduces latency. I am sure Nvidia will be close behind and next iteration >>>> will have a similar unified architecture that reduces latency. Moving >>>> memory form CPU to GPU and back kills a lot of the performance gains when >>>> it comes to the CLA. >>>> >>>> Hideaki, you are correct, the synapse traversals in the CLA make it a >>>> poor fit for current GPU architecture. You have to constantly reach out to >>>> other memory blocks to get the data you need to perform updates. >>>> Spatial pooler is a better fit for current generation of GPU's and we could >>>> start with spatial pooler GPU acceleration and gain some benefits right >>>> now. >>>> >>>> >>>> >>>> >>>> On Wed, May 7, 2014 at 1:10 PM, Hideaki Suzuki <[email protected]>wrote: >>>> >>>>> Hi Sergey, >>>>> >>>>> Thank you for the link. This is really a cool idea. The recent high >>>>> end GPU has a few GB of memory on the device even for gaming like GTX 780 >>>>> Ti. It looks like the main memory would be L3 or L4 cache for the memory >>>>> on the GPU device and CPU is like a co-processor for some kind of >>>>> computational tasks. ;-) >>>>> >>>>> >>>>> Unfortunately, I'm not aware of a progress about Nupic on GPU. I >>>>> guess the difficulty comes from 1) the parallel link (=synapse) traversals >>>>> in CLA and 2) a solid C code base that can be ported to GPU kernel. >>>>> >>>>> Once nupic.core becomes ready, it may be a good time to try this >>>>> porting out. In addition, CUDA 6 has been released recently which will >>>>> make >>>>> programmer's lives easier. So the situation is better than ever. I >>>>> believe... [*] >>>>> >>>>> A few months ago, I draw a sketch on a piece of paper about how to run >>>>> SP on GPU. I picked it up and transferred my dirty hand writing to a pdf >>>>> file, in hope that you may find it interesting... PFA. >>>>> (note: my sketch is not about a porting design but about a >>>>> GPU-oriented design unrelated to Nupic) >>>>> >>>>> [*] https://developer.nvidia.com/cuda-toolkit >>>>> >>>>> >>>>> _______________________________________________ >>>>> nupic mailing list >>>>> [email protected] >>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> nupic mailing list >>>> [email protected] >>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>> >>>> >>> >>> _______________________________________________ >>> nupic mailing list >>> [email protected] >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>> >>> >> >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > >
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
