Re: [nupic-discuss] GPU data base

Daniel Bell Wed, 07 May 2014 20:29:14 -0700

What about OpenCL?


On Thu, May 8, 2014 at 2:46 PM, Oreste Villa <[email protected]> wrote:

> Hi Greg,
>
> About a year ago I have looked to the CLA algorithm with the aim of
> porting it on GPUs.
>
> In principle the algorithm should be an "easy" port (running entirely on
> the GPU) but at that time there were many SW engineering issues to
> overcome...more below.
>
> I would classify the CLA algorithm more like a molecular dynamic
> simulation than anything like a sparse matrix product. Maybe it could be
> mathematically reformulated as such, but I don't know about that.
>
> It is basically an algorithm where for each column (equivalent of a
> cortical micro-column) and for each synapse in the column you check around
> a certain cut-off radius for a binary match with the observed input (or
> synapses from other columns). A permanence value for the synapse in
> incremented or decremented based the fact you overlap or not with the
> inputs. Things are much more complicated (as there are many more phases)
> and for a better explanation of the algorithmic (with pseudo-code) you can
> refer to:
>
> http://numenta.org/resources/HTM_CorticalLearningAlgorithms.pdf
>
> The SW engineering problem I had back when looked into this was that the
> code was mostly  Python and the early C++ implementation was with many
> encapsulated classes and layers like
> region->columns->cell->segments->synapses->etc.
> Each class had its own methods, attributes and properties. This for
> instance resulted in a nightmare trying to copy all the permanence of all
> the synapses to an array where all the GPU threads could work on.
>
> A natural parallelization would have been to assign different regions to
> different Kernels and to parallelize the permanence checks and updates for
> spatial and temporal polling in a kernel. Depending on which phase of the
> algorithm a GPU thread could be working on a column or a cell.
>
> I don't know what is the current status of the C++ effort, but my feeling
> with respect to the early stages is that the engine needed to be more
> lightweight and reorganized more to use structures of arrays (like for
> instance array of all the permanence values, etc)
>
>
> Oreste
>
>
> On Wed, May 7, 2014 at 7:35 PM, Ian Danforth <[email protected]>wrote:
>
>> Greg,
>>
>>  Thanks for this detailed answer! If you ever feel interested in porting
>> CLA I'd love to work with you. A typical model today might maintain ~409600
>> permanences for the spatial pooler (2048 columns * 400 bit input * .5
>> potential pool) and then the temporal pooler has potentially 65536^2
>> weights, but starts with a tiny tiny fraction of that and grows over time.
>> That growth process is probably the most tricky to manage. That said, the
>> way new connections are chosen could probably be done in a more GPU
>> friendly way if we understood the most efficient way to do so.
>>
>> Ian
>>
>>
>>
>>
>> On Wed, May 7, 2014 at 5:22 PM, Greg Diamos <[email protected]>wrote:
>>
>>> CUDA 6 is mainly about removing the need for programmers to use explicit
>>> DMA copies
>>> to move data structures between CPU and GPU memory.  For applications
>>> that just need
>>> to copy inputs to the GPU, perform some computationally dense operation,
>>> then read
>>> a few results back, performance should be similar to DMA operations.
>>>
>>> You typically can get around 12GB/s for DMA copies between the CPU and
>>> GPU for
>>> DMA sizes larger than 32KB.  This is potentially less bandwidth than the
>>> CPU can get
>>> out of DRAM (usually something like 26GB/s or 52 GB/s), and the access
>>> size is
>>> substantially bigger.
>>>
>>> I don't know enough about CLA to say how difficult a GPU port would be,
>>> but I do have some
>>> experience with porting deep neural networks to GPUs, and maybe someone
>>> with more
>>> experience with the CLA could comment on similarities and differences.
>>>
>>> Deep neural networks are actually extremely easy to port to GPUs.  You
>>> model them
>>> as an array of layers (each represented by a dense or block-sparse
>>> matrix).  Each entry
>>> in one of these matrices is a floating point value that represents the
>>> weight for a neuron
>>> connection.  You load these matrices directly in GPU memory when the
>>> program starts,
>>> and doing this is just about as fast a loading them into CPU memory
>>> (since you are probably
>>> loading it from disk and the CPU->GPU link is about 10-50x higher
>>> bandwidth than a disk).
>>> You then load a batch of samples into GPU memory, and propagate them
>>> back and forth
>>> through the network.  The bulk of the work to propagate a set of samples
>>> through one layer
>>> in the network is performed using a dense matrix-matrix multiply
>>> operation for a fully-connected
>>> layer, or a batched (parallel) set of smaller matrix-matrix multiplies
>>> for a
>>> locally-connected/convolutional layer.  This operation is both extremely
>>> compute-intensive,
>>> and extremely parallel.  It is straightforward to achieve high
>>> percentages of peak throughput,
>>> and there is no need to write any GPU code.  You can just call into a
>>> matrix library, or an image
>>> processing library when the tiles get small enough (and start using
>>> shared weights) that they
>>> start looking more like convolutions than general matrix multiplies.
>>>
>>> GPU memories are in the range of 1-12GB, so you can model a network with
>>> something like a billion connections on a single GPU.  If you have more
>>> samples than will
>>> fit into GPU memory, then you batch them and train stochastically over
>>> multiple iterations.  Again,
>>> you are typically either bottlenecked by the network evaluation (for a
>>> modest to large network),
>>> or the speed at which you can read them off of disk (for a small
>>> network).
>>>
>>> DMA throughput or latency doesn't ever come into the picture because all
>>> of the work is performed
>>> by the GPU and all of the data stays on the GPU until you are ready to
>>> assign a prediction to each sample
>>> or save the updated model back to disk.
>>>
>>> A straightforward way to port the CLA would be to adopt a similar model.
>>>  Load the entire model into GPU memory,
>>> then run samples through it.  It would be especially useful to be able
>>> to cast the bulk of the work performed by
>>> CLA in terms of linear algebra or other well known operations like
>>> convolution because it is actually quite
>>> difficult (i.e. time consuming) to write high performance low level
>>> code.    Note that this isn't just
>>> true for GPUs, most people can't write a dense matrix library that
>>> competes with MKL on Intel CPUs either.
>>>
>>> Greg
>>>
>>>
>>>
>>>
>>> On Wed, May 7, 2014 at 2:54 PM, Doug King <[email protected]> wrote:
>>>
>>>> Hideaki your PDF is a good starting point for implementing a GPU
>>>> accelerated spatial pooler.
>>>>
>>>> I have pointed to this before - some work has been done assessing the
>>>> CLA (based on CLA whitepaper 1.0) and opportunities to parallelize it with
>>>> GPU.  http://pdxscholar.library.pdx.edu/open_access_etds/202/
>>>>
>>>> RE: Cuda 6, Latency is still an issue. Cuda 6 makes memory management
>>>> simpler. But memory copies between the system and GPU memory still need to
>>>> be done and the latencies remain. On the upside CUDA 6 handles the
>>>> programming of data block transfers for developers transparently. In
>>>> contrast, AMD's HSA architecture implements truly unified memory access and
>>>> reduces latency. I am sure Nvidia will be close behind and next iteration
>>>> will have a similar unified architecture that reduces latency. Moving
>>>> memory form CPU to GPU and back kills a lot of the performance gains when
>>>> it comes to the CLA.
>>>>
>>>> Hideaki, you are correct, the synapse traversals in the CLA make it a
>>>> poor fit for current GPU architecture. You have to constantly reach out to
>>>> other memory blocks to get the data you need to perform updates.
>>>> Spatial pooler is a better fit for current generation of GPU's and we could
>>>> start with spatial pooler GPU acceleration and gain some benefits right 
>>>> now.
>>>>
>>>>
>>>>
>>>>
>>>>  On Wed, May 7, 2014 at 1:10 PM, Hideaki Suzuki <[email protected]>wrote:
>>>>
>>>>>  Hi Sergey,
>>>>>
>>>>> Thank you for the link.  This is really a cool idea.  The recent high
>>>>> end GPU has a few GB of memory on the device even for gaming like GTX 780
>>>>> Ti.  It looks like the main memory would be L3 or L4 cache for the memory
>>>>> on the GPU device and CPU is like a co-processor for some kind of
>>>>> computational tasks. ;-)
>>>>>
>>>>>
>>>>> Unfortunately, I'm not aware of a progress about Nupic on GPU.  I
>>>>> guess the difficulty comes from 1) the parallel link (=synapse) traversals
>>>>> in CLA and 2) a solid C code base that can be ported to GPU kernel.
>>>>>
>>>>> Once nupic.core becomes ready, it may be a good time to try this
>>>>> porting out. In addition, CUDA 6 has been released recently which will 
>>>>> make
>>>>> programmer's lives easier.  So the situation is better than ever. I
>>>>> believe... [*]
>>>>>
>>>>> A few months ago, I draw a sketch on a piece of paper about how to run
>>>>> SP on GPU.  I picked it up and transferred my dirty hand writing to a pdf
>>>>> file, in hope that you may find it interesting...  PFA.
>>>>> (note: my sketch is not about a porting design but about a
>>>>> GPU-oriented design unrelated to Nupic)
>>>>>
>>>>> [*] https://developer.nvidia.com/cuda-toolkit
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> nupic mailing list
>>>>> [email protected]
>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>
>>>>
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>
>>>
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-discuss] GPU data base

Reply via email to