For sure it should be OpenCL. Also GPU cards could be cross connected
each other, means increasing memmory and core numbers.
On 08.05.2014 07:28, Daniel Bell wrote:
What about OpenCL?
On Thu, May 8, 2014 at 2:46 PM, Oreste Villa <[email protected]
<mailto:[email protected]>> wrote:
Hi Greg,
About a year ago I have looked to the CLA algorithm with the aim
of porting it on GPUs.
In principle the algorithm should be an "easy" port (running
entirely on the GPU) but at that time there were many SW
engineering issues to overcome...more below.
I would classify the CLA algorithm more like a molecular dynamic
simulation than anything like a sparse matrix product. Maybe it
could be mathematically reformulated as such, but I don't know
about that.
It is basically an algorithm where for each column (equivalent of
a cortical micro-column) and for each synapse in the column you
check around a certain cut-off radius for a binary match with the
observed input (or synapses from other columns). A permanence
value for the synapse in incremented or decremented based the fact
you overlap or not with the inputs. Things are much more
complicated (as there are many more phases) and for a better
explanation of the algorithmic (with pseudo-code) you can refer to:
http://numenta.org/resources/HTM_CorticalLearningAlgorithms.pdf
The SW engineering problem I had back when looked into this was
that the code was mostly Python and the early C++ implementation
was with many encapsulated classes and layers like
region->columns->cell->segments->synapses->etc.
Each class had its own methods, attributes and properties. This
for instance resulted in a nightmare trying to copy all the
permanence of all the synapses to an array where all the GPU
threads could work on.
A natural parallelization would have been to assign different
regions to different Kernels and to parallelize the permanence
checks and updates for spatial and temporal polling in a kernel.
Depending on which phase of the algorithm a GPU thread could be
working on a column or a cell.
I don't know what is the current status of the C++ effort, but my
feeling with respect to the early stages is that the engine needed
to be more lightweight and reorganized more to use structures of
arrays (like for instance array of all the permanence values, etc)
Oreste
On Wed, May 7, 2014 at 7:35 PM, Ian Danforth
<[email protected] <mailto:[email protected]>> wrote:
Greg,
Thanks for this detailed answer! If you ever feel interested
in porting CLA I'd love to work with you. A typical model
today might maintain ~409600 permanences for the spatial
pooler (2048 columns * 400 bit input * .5 potential pool) and
then the temporal pooler has potentially 65536^2 weights, but
starts with a tiny tiny fraction of that and grows over time.
That growth process is probably the most tricky to manage.
That said, the way new connections are chosen could probably
be done in a more GPU friendly way if we understood the most
efficient way to do so.
Ian
On Wed, May 7, 2014 at 5:22 PM, Greg Diamos
<[email protected] <mailto:[email protected]>>
wrote:
CUDA 6 is mainly about removing the need for programmers
to use explicit DMA copies
to move data structures between CPU and GPU memory. For
applications that just need
to copy inputs to the GPU, perform some computationally
dense operation, then read
a few results back, performance should be similar to DMA
operations.
You typically can get around 12GB/s for DMA copies between
the CPU and GPU for
DMA sizes larger than 32KB. This is potentially less
bandwidth than the CPU can get
out of DRAM (usually something like 26GB/s or 52 GB/s),
and the access size is
substantially bigger.
I don't know enough about CLA to say how difficult a GPU
port would be, but I do have some
experience with porting deep neural networks to GPUs, and
maybe someone with more
experience with the CLA could comment on similarities and
differences.
Deep neural networks are actually extremely easy to port
to GPUs. You model them
as an array of layers (each represented by a dense or
block-sparse matrix). Each entry
in one of these matrices is a floating point value that
represents the weight for a neuron
connection. You load these matrices directly in GPU
memory when the program starts,
and doing this is just about as fast a loading them into
CPU memory (since you are probably
loading it from disk and the CPU->GPU link is about 10-50x
higher bandwidth than a disk).
You then load a batch of samples into GPU memory, and
propagate them back and forth
through the network. The bulk of the work to propagate a
set of samples through one layer
in the network is performed using a dense matrix-matrix
multiply operation for a fully-connected
layer, or a batched (parallel) set of smaller
matrix-matrix multiplies for a
locally-connected/convolutional layer. This operation is
both extremely compute-intensive,
and extremely parallel. It is straightforward to achieve
high percentages of peak throughput,
and there is no need to write any GPU code. You can just
call into a matrix library, or an image
processing library when the tiles get small enough (and
start using shared weights) that they
start looking more like convolutions than general matrix
multiplies.
GPU memories are in the range of 1-12GB, so you can model
a network with
something like a billion connections on a single GPU. If
you have more samples than will
fit into GPU memory, then you batch them and train
stochastically over multiple iterations. Again,
you are typically either bottlenecked by the network
evaluation (for a modest to large network),
or the speed at which you can read them off of disk (for a
small network).
DMA throughput or latency doesn't ever come into the
picture because all of the work is performed
by the GPU and all of the data stays on the GPU until you
are ready to assign a prediction to each sample
or save the updated model back to disk.
A straightforward way to port the CLA would be to adopt a
similar model. Load the entire model into GPU memory,
then run samples through it. It would be especially
useful to be able to cast the bulk of the work performed by
CLA in terms of linear algebra or other well known
operations like convolution because it is actually quite
difficult (i.e. time consuming) to write high performance
low level code. Note that this isn't just
true for GPUs, most people can't write a dense matrix
library that competes with MKL on Intel CPUs either.
Greg
On Wed, May 7, 2014 at 2:54 PM, Doug King
<[email protected] <mailto:[email protected]>> wrote:
Hideaki your PDF is a good starting point for
implementing a GPU accelerated spatial pooler.
I have pointed to this before - some work has been
done assessing the CLA (based on CLA whitepaper 1.0)
and opportunities to parallelize it with GPU.
http://pdxscholar.library.pdx.edu/open_access_etds/202/
RE: Cuda 6, Latency is still an issue. Cuda 6 makes
memory management simpler. But memory copies between
the system and GPU memory still need to be done and
the latencies remain. On the upside CUDA 6 handles the
programming of data block transfers for developers
transparently. In contrast, AMD's HSA architecture
implements truly unified memory access and reduces
latency. I am sure Nvidia will be close behind and
next iteration will have a similar
unified architecture that reduces latency. Moving
memory form CPU to GPU and back kills a lot of the
performance gains when it comes to the CLA.
Hideaki, you are correct, the synapse traversals in
the CLA make it a poor fit for current GPU
architecture. You have to constantly reach out to
other memory blocks to get the data you need to
perform updates.Spatial pooler is a better fit for
current generation of GPU's and we could start with
spatial pooler GPU acceleration and gain some benefits
right now.
On Wed, May 7, 2014 at 1:10 PM, Hideaki Suzuki
<[email protected] <mailto:[email protected]>> wrote:
Hi Sergey,
Thank you for the link. This is really a cool
idea. The recent high end GPU has a few GB of
memory on the device even for gaming like GTX 780
Ti. It looks like the main memory would be L3 or
L4 cache for the memory on the GPU device and CPU
is like a co-processor for some kind of
computational tasks. ;-)
Unfortunately, I'm not aware of a progress about
Nupic on GPU. I guess the difficulty comes from
1) the parallel link (=synapse) traversals in CLA
and 2) a solid C code base that can be ported to
GPU kernel.
Once nupic.core becomes ready, it may be a good
time to try this porting out. In addition, CUDA 6
has been released recently which will make
programmer's lives easier. So the situation is
better than ever. I believe... [*]
A few months ago, I draw a sketch on a piece of
paper about how to run SP on GPU. I picked it up
and transferred my dirty hand writing to a pdf
file, in hope that you may find it interesting...
PFA.
(note: my sketch is not about a porting design but
about a GPU-oriented design unrelated to Nupic)
[*] https://developer.nvidia.com/cuda-toolkit
_______________________________________________
nupic mailing list
[email protected]
<mailto:[email protected]>
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
_______________________________________________
nupic mailing list
[email protected] <mailto:[email protected]>
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
_______________________________________________
nupic mailing list
[email protected] <mailto:[email protected]>
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
_______________________________________________
nupic mailing list
[email protected] <mailto:[email protected]>
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
_______________________________________________
nupic mailing list
[email protected] <mailto:[email protected]>
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org