Re: [nupic-dev] NuPIC performance requirements

Marek Otahal Thu, 22 Aug 2013 03:21:27 -0700

Subutai, thanks for the valuable info!

To all, thanks for all the ideas submitted. I'll add to the discussion a
JIRA issue https://issues.numenta.org/browse/NPC-320
about taking advantage of specialized math optimization libraries.


I definitely support the multi-core/parallel architecture for nupic,
https://issues.numenta.org/browse/NPC-259 , and I think this could (at
least to some extent quite easily) be achieved.

I would be interested in the GPGPU support,
https://issues.numenta.org/browse/NPC-258 , but there was a discussion if
it's reasonable for this kind of tasks (time to transfer data GPU<->CPU).

I hope to put some time,or money to the Eigen optimizations, so would like
to know if it's good/ok/orthogonal for the GPGPU task.

Best regards, Mark


On Thu, Aug 22, 2013 at 6:21 AM, Subutai Ahmad <[email protected]> wrote:

>
> Oreste and Doug - thank you for the comments! It is great to see these
> ideas. I agree that performance optimization is extremely important. I want
> to encourage you guys to continue the discussion and hopefully implement
> something too. A speed improvement will help the community and accelerate
> CLA based experimentation. (One of the main reasons we stopped focusing on
> vision was the inability to run large experiments in reasonable time.)
>
> I don't know if this is useful, but here's a quick guide to the current
> optimized code:
>
> Currently within NuPIC the CLA is implemented as a combination of Python
> and C++. Over time we moved the slower portions of CLA to C++.   Today, the
> code is reasonably fast. A CLA region with 2048 columns and 65536 cells
> takes about 20 msecs per iteration on my laptop. This includes SP, TP, the
> classifier and full online learning turned on.
>
> There are two main building blocks in C++ from a performance standpoint.
>
> 1) There are a set of SparseMatrix classes that implements fast data
> structures  for sparse vectors and matrices. They include a large number of
> utility routines that are optimized for CLA functions.  Today these are
> primarily used  by the Spatial Pooler. There are Python bindings for these
> classes. You can go to 
> nupic/examples/bindings/sparse_matrix_how_to.py<https://github.com/numenta/nupic/blob/master/examples/bindings/sparse_matrix_how_to.py>for
>  a tutorial on this.
>
> 2) There are a set of classes the implement fast data structures for the
> temporal pooler. These are very specific to temporal pooling. Sparse
> matrices were not enough for the TP. The input is very sparse and we
> implemented some strategies for evaluating only cells that connected to
> those ON input bits. The main starting point for this code is
> nupic/nta/algorithms/Cells4.hpp<https://github.com/numenta/nupic/blob/master/nta/algorithms/Cells4.hpp>.
> There are python bindings for this as well, and they are called by the
> Python temporal pooler class 
> TP10X2.py<https://github.com/numenta/nupic/blob/master/py/nupic/research/TP10X2.py>
>
> Our curent plan is to create a pure C++ spatial pooler implementation (see
> Gil's email). This will also be much cleaner than the current
> implementation. Perhaps it can serve as a base for some of the ideas that
> Oreste and Doug have mentioned.
>
> We have not extensively explored multi-threaded options. We mainly focused
> on serial optimizations so far.
>
> Hope this helps!
>
> --Subutai
>
>
>
> On Wed, Aug 21, 2013 at 2:02 PM, Oreste Villa <[email protected]>wrote:
>
>> Thanks for the pointers, I can start looking into the code and the
>> tickets. Unfortunately I have limited time and I will be looking into it as
>> a "hobby" project for the moment.
>>
>> I think there is still some time before moving to opencl (or other). As
>> you said, the first thing in the list is to have a very good c++
>> parallelazible cpu implementation. I would say the best think is most
>> likely openmp, also gcc4.9 should in the very near future support openacc
>> which is basically the equivalent of openmp but for gpus. So maybe we can
>> skip opencl completely. BTW I am not a big fan of opencl (because I think
>> is very verbose and tedious to use) and I have worked a lot with CUDA and
>> MPI. I agree opencl is more portable than CUDA but openacc targets both
>> and  should solve this issue.
>>
>> Regarding unified address space from the programming model point of view
>> is going to come very soon but hardware support with indistinguishable
>> performance should come in 2 generations (crossing fingers).
>>
>> Oreste
>> On Aug 21, 2013 10:32 AM, "Doug King" <[email protected]> wrote:
>>
>>> Hi Oreste,
>>>
>>> Good points and observations. My comments below.
>>>
>>> *1) Make sure all the SP and TP (basically the code with a lot of
>>> nested loops) is at least C++. I would also make sure I am not abusing too
>>> much of STL library or boost library (like for instance in the list of
>>> active columns , etc) as this is usually tricky (or low performance) when
>>> ported to GPUs.*
>>>
>>> We are doing some of this in the current codebase. See Jira ticket
>>> https://issues.numenta.org/browse/NPC-286  - porting entire CLA to c++
>>> with language bindings for other languages. First steps are to migrate
>>> spatial pooler to c++. To see progress on this check here:
>>> https://issues.numenta.org/browse/NPC-246  I am not sure of the
>>> approach they are taking regarding STL and Boost library.
>>>
>>> *2) I would convert the above nested loops (for all columns, for all
>>> cells, etc) to OpenCL. We have to be careful about data movement today and
>>> try to keep most of the data in GPU space (although this problem is going
>>> to disappear with next or next-next generation of GPUs which are going to
>>> have unified address space). *
>>>
>>> Yes, you are right about data movement. I don't know much about how
>>> OpenCL abstracts storage and message passing to nodes. Data movement could
>>> kill any gains you get from parallel computing if not handled correctly -
>>> biggest issue is, how to move input data onto the GPU and extract
>>> prediction back out. I/O for each time frame is the problem. First steps I
>>> think would be to convert those areas to CPU manycore parallel code in C++
>>> to sort out any issues, then OpenCL.
>>>
>>> Interesting about next-gen GPUs and unified address space - how close
>>> are we to getting unified memory space GPU?
>>>
>>> *3) Support for multiple GPU within the a single node. I would map
>>> different regions on different host-threads and GPUs and for large regions
>>> I would try would partition them across multiple host-threads and GPUs.
>>> Considering that they are 2D regions and communication is mostly localized
>>> around the columns it should be doable. However on the boundaries of the
>>> partitions is going to be tricky as updates to cells or columns will depend
>>> on values that are in another GPU address space (again next-next generation
>>> of accelerators should solve this problem with unified address space).*
>>>
>>> Perhaps we should be looking at cheap mulit-core CPUs for now to address
>>> this, or would we need to port the entire CLA into OpenCL to run on shared
>>> memory in the GPU ? I don't know enough about the architecture of current
>>> code, OpenCL and GPU memory space to understand the right approach, but we
>>> should take logical steps that we can build on to get there.
>>>
>>> *4) To parallelize across multiple nodes in a cluster I would
>>> definitively go for MPI and not map-reduce. The reason is that map-reduce
>>> is used mostly on embarrassing parallel jobs with a final reduce phase to
>>> compute the final result. In our case considering that the "computation" is
>>> based around the concept of step (or clock) at every step there is going to
>>> be a significant amount of communication across regions and within the
>>> region. Thankfully if we stick to the concept of time step (not exactly
>>> brain like) we can batch that communication and perform it at the end of
>>> each step. If using MPI, I would map different regions on different MPI
>>> ranks and for large regions I would partition them on multiple MPI ranks as
>>> explained for point 3 within the node (basically a hierarchy of
>>> parallelization).*
>>>
>>> Agreed - Map-Reduce is not the right paradigm for the reasons you state.
>>> For moving lots of data across multiple compute nodes MPI is the standard
>>> and might be appropriate to adopt here I think.
>>>
>>> When we start talking about hierarchy though, I think operation would be
>>> similar to map/reduce with lower regions computing on nodes that are
>>> independent of each other and which feed higher regions at a slower rate.
>>> For example, audio (speech) prediction - break audio into spectrum
>>> (frequency bands), feed each band in the spectrum to individual regions.
>>> Feed predictions of each region into a higher region that aggregates all
>>> the predictions of individual audio bands.
>>>
>>> If we want to move this forward we should come up with a roadmap and
>>> next steps. Input from others here would be appreciated.
>>>
>>> -Doug
>>>
>>>
>>> On Wed, Aug 21, 2013 at 8:26 AM, Oreste Villa <[email protected]>wrote:
>>>
>>>> So if I understand correctly mostly of the code is Python but the
>>>> "core" is mostly C++ or is going to be C++.
>>>>
>>>> The way I would approach this problem is the following (in temporal
>>>> order):
>>>>
>>>> 1) Make sure all the SP and TP (basically the code with a lot of nested
>>>> loops) is at least C++. I would also make sure I am not abusing too much of
>>>> STL library or boost library (like for instance in the list of active
>>>> columns , etc) as this is usually tricky (or low performance) when ported
>>>> to GPUs.
>>>>
>>>> 2) I would convert the above nested loops (for all columns, for all
>>>> cells, etc) to OpenCL. We have to be careful about data movement today and
>>>> try to keep most of the data in GPU space (although this problem is going
>>>> to disappear with next or next-next generation of GPUs which are going to
>>>> have unified address space).
>>>>
>>>> 3) Support for multiple GPU within the a single node. I would map
>>>> different regions on different host-threads and GPUs and for large regions
>>>> I would try would partition them across multiple host-threads and GPUs.
>>>> Considering that they are 2D regions and communication is mostly localized
>>>> around the columns it should be doable. However on the boundaries of the
>>>> partitions is going to be tricky as updates to cells or columns will depend
>>>> on values that are in another GPU address space (again next-next generation
>>>> of accelerators should solve this problem with unified address space).
>>>>
>>>> 4) To parallelize across multiple nodes in a cluster I would
>>>> definitively go for MPI and not map-reduce. The reason is that map-reduce
>>>> is used mostly on embarrassing parallel jobs with a final reduce phase to
>>>> compute the final result. In our case considering that the "computation" is
>>>> based around the concept of step (or clock) at every step there is going to
>>>> be a significant amount of communication across regions and within the
>>>> region. Thankfully if we stick to the concept of time step (not exactly
>>>> brain like) we can batch that communication and perform it at the end of
>>>> each step. If using MPI, I would map different regions on different MPI
>>>> ranks and for large regions I would partition them on multiple MPI ranks as
>>>> explained for point 3 within the node (basically a hierarchy of
>>>> parallelization).
>>>>
>>>> The code would obliviously work also with a single MPI process and
>>>> therefore on a normal workstation with or without GPUs. BTW with GPU I mean
>>>> also Intel Phi. Also I estimate from point 1 to 4 at least 1 or 2 years of
>>>> work depending on the number of people involved.
>>>>
>>>> Regarding hardware implementation, I also feel is the right way to go
>>>> in the long term but for now I would definitively go with the above
>>>> solution (considering most likely the algorithm will change in the next
>>>> years).
>>>> If well implemented the above approach could increase performance of at
>>>> least 2 orders of magnitude within a node and most likely scale linearly
>>>> across a moderate number of cluster nodes.
>>>>
>>>> I know some of the people on this mailing list have implemented their
>>>> own C++ version of HTM in the past, so I am sure they would be definitively
>>>> interested. Comments are welcome.
>>>>
>>>> Oreste
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Aug 20, 2013 at 11:22 PM, Doug King <[email protected]> wrote:
>>>>
>>>>> Hi Oreste,
>>>>>
>>>>> you are right, performance will be a central issue. There are a few
>>>>> bottlenecks in the algorithm that can be attacked with hardware
>>>>> acceleration. The best approach I can think of for now is to use
>>>>> parallelization (some form of map-reduce) to solve this. OpenCL would be a
>>>>> good choice to use in place of some of the C++ or Python code. The rest of
>>>>> the Python code could be kept as-is to allow for easy experimentation for
>>>>> optimization of parameters or changes to features that are not core CLA
>>>>> algorithms.
>>>>>
>>>>> There are many OpenCL drivers for GPUs and there is even a platform
>>>>> for converting OpenCL code to FPGA hardware. Eventually the CLA will be
>>>>> ported to some sort of digital/analog hybrid device that simulates
>>>>> dendrite/synapse connection on neuromorphic silicone. This will not be far
>>>>> off - maybe 5 years or less for early experiments, 10 years for cheap
>>>>> commodity devices.
>>>>>
>>>>> For now, most of us are trying to get results that are proof of
>>>>> concept with the current code base, then we will figure out how to scale 
>>>>> up
>>>>> and optimize.
>>>>>
>>>>> Another key to acceleration will be the sharing of trained networks
>>>>> that have encapsulated many CPU hours of training on fundamental streams 
>>>>> of
>>>>> data, for example speech audio, that once trained will be shared or sold.
>>>>> If this happens the building blocks of lower HTM regions could be 
>>>>> leveraged
>>>>> to get to the next level. We need to work towards some CLA network
>>>>> serialization standards for this to happen.
>>>>>
>>>>> I think you are correct in your assumptions, and if you want to
>>>>> contribute to the effort to move to a more performant version of the code 
>>>>> I
>>>>> would love to see someone port some of the critial segments of the CLA 
>>>>> code
>>>>> to OpenCL. For an analysis of where the bottlenecks are in the CLA and
>>>>> hardware solutions you can start by checking out this paper:
>>>>> http://www.pdx.edu/sites/www.pdx.edu.sysc/files/SySc.Seminar.Hammestrom.May.2011.pdf
>>>>>
>>>>> -Doug
>>>>>
>>>>>
>>>>> On Tue, Aug 20, 2013 at 9:59 PM, oreste villa 
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> Hello everybody, this is my first post on this list so please forgive
>>>>>> me if this has already been addressed before.
>>>>>>
>>>>>> I have seen that the current NuPIC source code is mostly Phyton and
>>>>>> I am wondering....
>>>>>>
>>>>>> I don't know about the problems people are trying to solve today
>>>>>> (maybe for demand and response of power in a building this is not true) 
>>>>>> but
>>>>>> in the future I believe performance is going to be a central issue. 
>>>>>> Python
>>>>>> seems to be a non-optimal choice in this respect (as single threaded 
>>>>>> Java,
>>>>>> single threaded C# or single threaded C++, or everything not parallel).
>>>>>>
>>>>>> I keep thinking for instance that the the Large Hadron Collider at CERN 
>>>>>> produces
>>>>>> something like 3 GByte <http://en.wikipedia.org/wiki/Megabyte>/s of
>>>>>> raw data and it would be really nice if we were able to feed at full
>>>>>> year of experiments in real time to a system based on the CLA. Also in
>>>>>> robotic, performance and I/O bandwidth requirements for vision, sensing 
>>>>>> and
>>>>>> motion control are impressive.
>>>>>>
>>>>>> The question/discussion point I wanted to make is, where does the
>>>>>> project stand in terms of performance? More specifically, are there any
>>>>>> plans to design high performance code inside NuPIC (openMP, CUDA,
>>>>>> MPI)? Is this something much less emphasized because the focus of the
>>>>>> project is more on learning the basic CLA principles?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Oreste
>>>>>>
>>>>>> _______________________________________________
>>>>>> nupic mailing list
>>>>>> [email protected]
>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> nupic mailing list
>>>>> [email protected]
>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>
>>>>
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>
>>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>


-- 
Marek Otahal :o)

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-dev] NuPIC performance requirements

Reply via email to