For sure it should be OpenCL. Also GPU cards could be cross connected each other, means increasing memmory and core numbers.

On 08.05.2014 07:28, Daniel Bell wrote:
What about OpenCL?


On Thu, May 8, 2014 at 2:46 PM, Oreste Villa <[email protected] <mailto:[email protected]>> wrote:

    Hi Greg,

    About a year ago I have looked to the CLA algorithm with the aim
    of porting it on GPUs.

    In principle the algorithm should be an "easy" port (running
    entirely on the GPU) but at that time there were many SW
    engineering issues to overcome...more below.

    I would classify the CLA algorithm more like a molecular dynamic
    simulation than anything like a sparse matrix product. Maybe it
    could be mathematically reformulated as such, but I don't know
    about that.

    It is basically an algorithm where for each column (equivalent of
    a cortical micro-column) and for each synapse in the column you
    check around a certain cut-off radius for a binary match with the
    observed input (or synapses from other columns). A permanence
    value for the synapse in incremented or decremented based the fact
    you overlap or not with the inputs. Things are much more
    complicated (as there are many more phases) and for a better
    explanation of the algorithmic (with pseudo-code) you can refer to:

    http://numenta.org/resources/HTM_CorticalLearningAlgorithms.pdf

    The SW engineering problem I had back when looked into this was
    that the code was mostly  Python and the early C++ implementation
    was with many encapsulated classes and layers like
    region->columns->cell->segments->synapses->etc.
    Each class had its own methods, attributes and properties. This
    for instance resulted in a nightmare trying to copy all the
    permanence of all the synapses to an array where all the GPU
    threads could work on.

    A natural parallelization would have been to assign different
    regions to different Kernels and to parallelize the permanence
    checks and updates for spatial and temporal polling in a kernel.
    Depending on which phase of the algorithm a GPU thread could be
    working on a column or a cell.

    I don't know what is the current status of the C++ effort, but my
    feeling with respect to the early stages is that the engine needed
    to be more lightweight and reorganized more to use structures of
    arrays (like for instance array of all the permanence values, etc)


    Oreste


    On Wed, May 7, 2014 at 7:35 PM, Ian Danforth
    <[email protected] <mailto:[email protected]>> wrote:

        Greg,

         Thanks for this detailed answer! If you ever feel interested
        in porting CLA I'd love to work with you. A typical model
        today might maintain ~409600 permanences for the spatial
        pooler (2048 columns * 400 bit input * .5 potential pool) and
        then the temporal pooler has potentially 65536^2 weights, but
        starts with a tiny tiny fraction of that and grows over time.
        That growth process is probably the most tricky to manage.
        That said, the way new connections are chosen could probably
        be done in a more GPU friendly way if we understood the most
        efficient way to do so.

        Ian




        On Wed, May 7, 2014 at 5:22 PM, Greg Diamos
        <[email protected] <mailto:[email protected]>>
        wrote:

            CUDA 6 is mainly about removing the need for programmers
            to use explicit DMA copies
            to move data structures between CPU and GPU memory.  For
            applications that just need
            to copy inputs to the GPU, perform some computationally
            dense operation, then read
            a few results back, performance should be similar to DMA
            operations.

            You typically can get around 12GB/s for DMA copies between
            the CPU and GPU for
            DMA sizes larger than 32KB.  This is potentially less
            bandwidth than the CPU can get
            out of DRAM (usually something like 26GB/s or 52 GB/s),
            and the access size is
            substantially bigger.

            I don't know enough about CLA to say how difficult a GPU
            port would be, but I do have some
            experience with porting deep neural networks to GPUs, and
            maybe someone with more
            experience with the CLA could comment on similarities and
            differences.

            Deep neural networks are actually extremely easy to port
            to GPUs.  You model them
            as an array of layers (each represented by a dense or
            block-sparse matrix).  Each entry
            in one of these matrices is a floating point value that
            represents the weight for a neuron
            connection.  You load these matrices directly in GPU
            memory when the program starts,
            and doing this is just about as fast a loading them into
            CPU memory (since you are probably
            loading it from disk and the CPU->GPU link is about 10-50x
            higher bandwidth than a disk).
            You then load a batch of samples into GPU memory, and
            propagate them back and forth
            through the network.  The bulk of the work to propagate a
            set of samples through one layer
            in the network is performed using a dense matrix-matrix
            multiply operation for a fully-connected
            layer, or a batched (parallel) set of smaller
            matrix-matrix multiplies for a
            locally-connected/convolutional layer.  This operation is
            both extremely compute-intensive,
            and extremely parallel.  It is straightforward to achieve
            high percentages of peak throughput,
            and there is no need to write any GPU code.  You can just
            call into a matrix library, or an image
            processing library when the tiles get small enough (and
            start using shared weights) that they
            start looking more like convolutions than general matrix
            multiplies.

            GPU memories are in the range of 1-12GB, so you can model
            a network with
            something like a billion connections on a single GPU.  If
            you have more samples than will
            fit into GPU memory, then you batch them and train
            stochastically over multiple iterations.  Again,
            you are typically either bottlenecked by the network
            evaluation (for a modest to large network),
            or the speed at which you can read them off of disk (for a
            small network).

            DMA throughput or latency doesn't ever come into the
            picture because all of the work is performed
            by the GPU and all of the data stays on the GPU until you
            are ready to assign a prediction to each sample
            or save the updated model back to disk.

            A straightforward way to port the CLA would be to adopt a
            similar model.  Load the entire model into GPU memory,
            then run samples through it.  It would be especially
            useful to be able to cast the bulk of the work performed by
            CLA in terms of linear algebra or other well known
            operations like convolution because it is actually quite
            difficult (i.e. time consuming) to write high performance
            low level code.    Note that this isn't just
            true for GPUs, most people can't write a dense matrix
            library that competes with MKL on Intel CPUs either.

            Greg




            On Wed, May 7, 2014 at 2:54 PM, Doug King
            <[email protected] <mailto:[email protected]>> wrote:

                Hideaki your PDF is a good starting point for
                implementing a GPU accelerated spatial pooler.

                I have pointed to this before - some work has been
                done assessing the CLA (based on CLA whitepaper 1.0)
                and opportunities to parallelize it with GPU.
                http://pdxscholar.library.pdx.edu/open_access_etds/202/

                RE: Cuda 6, Latency is still an issue. Cuda 6 makes
                memory management simpler. But memory copies between
                the system and GPU memory still need to be done and
                the latencies remain. On the upside CUDA 6 handles the
                programming of data block transfers for developers
                transparently. In contrast, AMD's HSA architecture
                implements truly unified memory access and reduces
                latency. I am sure Nvidia will be close behind and
                next iteration will have a similar
                unified architecture that reduces latency. Moving
                memory form CPU to GPU and back kills a lot of the
                performance gains when it comes to the CLA.

                Hideaki, you are correct, the synapse traversals in
                the CLA make it a poor fit for current GPU
                architecture. You have to constantly reach out to
                other memory blocks to get the data you need to
                perform updates.Spatial pooler is a better fit for
                current generation of GPU's and we could start with
                spatial pooler GPU acceleration and gain some benefits
                right now.




                On Wed, May 7, 2014 at 1:10 PM, Hideaki Suzuki
                <[email protected] <mailto:[email protected]>> wrote:

                    Hi Sergey,

                    Thank you for the link.  This is really a cool
                    idea.  The recent high end GPU has a few GB of
                    memory on the device even for gaming like GTX 780
                    Ti.  It looks like the main memory would be L3 or
                    L4 cache for the memory on the GPU device and CPU
                    is like a co-processor for some kind of
                    computational tasks. ;-)


                    Unfortunately, I'm not aware of a progress about
                    Nupic on GPU.  I guess the difficulty comes from
                    1) the parallel link (=synapse) traversals in CLA
                    and 2) a solid C code base that can be ported to
                    GPU kernel.

                    Once nupic.core becomes ready, it may be a good
                    time to try this porting out. In addition, CUDA 6
                    has been released recently which will make
                    programmer's lives easier.  So the situation is
                    better than ever. I believe... [*]

                    A few months ago, I draw a sketch on a piece of
                    paper about how to run SP on GPU.  I picked it up
                    and transferred my dirty hand writing to a pdf
                    file, in hope that you may find it interesting...
                     PFA.
                    (note: my sketch is not about a porting design but
                    about a GPU-oriented design unrelated to Nupic)

                    [*] https://developer.nvidia.com/cuda-toolkit


                    _______________________________________________
                    nupic mailing list
                    [email protected]
                    <mailto:[email protected]>
                    
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org



                _______________________________________________
                nupic mailing list
                [email protected] <mailto:[email protected]>
                
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org



            _______________________________________________
            nupic mailing list
            [email protected] <mailto:[email protected]>
            http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org



        _______________________________________________
        nupic mailing list
        [email protected] <mailto:[email protected]>
        http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org



    _______________________________________________
    nupic mailing list
    [email protected] <mailto:[email protected]>
    http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org




_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Reply via email to