CUDA 6 is mainly about removing the need for programmers to use explicit
DMA copies
to move data structures between CPU and GPU memory.  For applications that
just need
to copy inputs to the GPU, perform some computationally dense operation,
then read
a few results back, performance should be similar to DMA operations.

You typically can get around 12GB/s for DMA copies between the CPU and GPU
for
DMA sizes larger than 32KB.  This is potentially less bandwidth than the
CPU can get
out of DRAM (usually something like 26GB/s or 52 GB/s), and the access size
is
substantially bigger.

I don't know enough about CLA to say how difficult a GPU port would be, but
I do have some
experience with porting deep neural networks to GPUs, and maybe someone
with more
experience with the CLA could comment on similarities and differences.

Deep neural networks are actually extremely easy to port to GPUs.  You
model them
as an array of layers (each represented by a dense or block-sparse matrix).
 Each entry
in one of these matrices is a floating point value that represents the
weight for a neuron
connection.  You load these matrices directly in GPU memory when the
program starts,
and doing this is just about as fast a loading them into CPU memory (since
you are probably
loading it from disk and the CPU->GPU link is about 10-50x higher bandwidth
than a disk).
You then load a batch of samples into GPU memory, and propagate them back
and forth
through the network.  The bulk of the work to propagate a set of samples
through one layer
in the network is performed using a dense matrix-matrix multiply operation
for a fully-connected
layer, or a batched (parallel) set of smaller matrix-matrix multiplies for a
locally-connected/convolutional layer.  This operation is both extremely
compute-intensive,
and extremely parallel.  It is straightforward to achieve high percentages
of peak throughput,
and there is no need to write any GPU code.  You can just call into a
matrix library, or an image
processing library when the tiles get small enough (and start using shared
weights) that they
start looking more like convolutions than general matrix multiplies.

GPU memories are in the range of 1-12GB, so you can model a network with
something like a billion connections on a single GPU.  If you have more
samples than will
fit into GPU memory, then you batch them and train stochastically over
multiple iterations.  Again,
you are typically either bottlenecked by the network evaluation (for a
modest to large network),
or the speed at which you can read them off of disk (for a small network).

DMA throughput or latency doesn't ever come into the picture because all of
the work is performed
by the GPU and all of the data stays on the GPU until you are ready to
assign a prediction to each sample
or save the updated model back to disk.

A straightforward way to port the CLA would be to adopt a similar model.
 Load the entire model into GPU memory,
then run samples through it.  It would be especially useful to be able to
cast the bulk of the work performed by
CLA in terms of linear algebra or other well known operations like
convolution because it is actually quite
difficult (i.e. time consuming) to write high performance low level code.
 Note that this isn't just
true for GPUs, most people can't write a dense matrix library that competes
with MKL on Intel CPUs either.

Greg




On Wed, May 7, 2014 at 2:54 PM, Doug King <[email protected]> wrote:

> Hideaki your PDF is a good starting point for implementing a GPU
> accelerated spatial pooler.
>
> I have pointed to this before - some work has been done assessing the CLA
> (based on CLA whitepaper 1.0) and opportunities to parallelize it with
> GPU.  http://pdxscholar.library.pdx.edu/open_access_etds/202/
>
> RE: Cuda 6, Latency is still an issue. Cuda 6 makes memory management
> simpler. But memory copies between the system and GPU memory still need to
> be done and the latencies remain. On the upside CUDA 6 handles the
> programming of data block transfers for developers transparently. In
> contrast, AMD's HSA architecture implements truly unified memory access and
> reduces latency. I am sure Nvidia will be close behind and next iteration
> will have a similar unified architecture that reduces latency. Moving
> memory form CPU to GPU and back kills a lot of the performance gains when
> it comes to the CLA.
>
> Hideaki, you are correct, the synapse traversals in the CLA make it a
> poor fit for current GPU architecture. You have to constantly reach out to
> other memory blocks to get the data you need to perform updates.  Spatial
> pooler is a better fit for current generation of GPU's and we could start
> with spatial pooler GPU acceleration and gain some benefits right now.
>
>
>
>
> On Wed, May 7, 2014 at 1:10 PM, Hideaki Suzuki <[email protected]> wrote:
>
>> Hi Sergey,
>>
>> Thank you for the link.  This is really a cool idea.  The recent high end
>> GPU has a few GB of memory on the device even for gaming like GTX 780 Ti.
>>  It looks like the main memory would be L3 or L4 cache for the memory on
>> the GPU device and CPU is like a co-processor for some kind of
>> computational tasks. ;-)
>>
>>
>> Unfortunately, I'm not aware of a progress about Nupic on GPU.  I guess
>> the difficulty comes from 1) the parallel link (=synapse) traversals in CLA
>> and 2) a solid C code base that can be ported to GPU kernel.
>>
>> Once nupic.core becomes ready, it may be a good time to try this porting
>> out. In addition, CUDA 6 has been released recently which will make
>> programmer's lives easier.  So the situation is better than ever. I
>> believe... [*]
>>
>> A few months ago, I draw a sketch on a piece of paper about how to run SP
>> on GPU.  I picked it up and transferred my dirty hand writing to a pdf
>> file, in hope that you may find it interesting...  PFA.
>> (note: my sketch is not about a porting design but about a GPU-oriented
>> design unrelated to Nupic)
>>
>> [*] https://developer.nvidia.com/cuda-toolkit
>>
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>
_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Reply via email to