Re: Serialization in opf - file size issue

Scott Purdy Fri, 18 Dec 2015 10:40:23 -0800

Also, the suggestions to compress the existing serialized state should help
a lot!


On Fri, Dec 18, 2015 at 10:32 AM, Scott Purdy <[email protected]> wrote:

> Great questions.
>
> - We consider HTM models to be "fixed resources" that use a constant
> amount of memory. However, we may not allocate all of this memory right up
> front. Since many problems will never use all of the possible segments, we
> allocate segments as they are created.
>
> - If you are using the new Python implementation of Temporal Memory, the
> number of segments is not limited. This is something we want to allow but
> should definitely not be the default. See
> https://github.com/numenta/nupic/issues/1588
>
> - Yes, the Cap'n Proto serialization will be MUCH smaller. I doubt there
> will be much decrease in size possible with any technique once we switch
> over.
>
> On Fri, Dec 18, 2015 at 8:03 AM, Marek Otahal <[email protected]>
> wrote:
>
>> I did a quick search for the answers of CAPNP:
>> https://capnproto.org/encoding.html
>>
>> See PACKING - a technique (compression) CAPNP itself offers (just
>> discards excessive zeros - think sparse vs dense vector); Scott: Are we
>> enabling this? IMHO we should from the start to avoid backwards compat.
>> issues later.
>>
>> COMPRESSION - for repetitive data, which HTM networks typically are, they
>> suggest compression with an external compression tool. Matt, this is
>> something that can/has to be implemented regardless of capnp/pickle. The
>> OPF code responsible for writing the file should afterwards call a
>> compression program; from my experience it helps extremely on current
>> (pickle) data.
>>
>>
>> On Fri, Dec 18, 2015 at 4:22 PM, Matthew Taylor <[email protected]> wrote:
>>
>>> BTW, we have been working on a new serialization format. The old one is
>>> using python's pickle functionality and there are several problems with it.
>>> The new method in NuPIC will be using Capn Proto serialization, which is a
>>> very fast and efficient technique that happens on the C++ side (and through
>>> the pycapnp adapter in python).
>>>
>>> Once we have this finished, the time it takes to save and retrieve
>>> models should decrease by about tenfold (based on Scott's initial
>>> experiments). I assume this will come along with a considerable decrease in
>>> serialization size on disk, but I have not checked. If Scott is reading
>>> this, maybe he can answer.
>>>
>>>
>>> ---------
>>> Matt Taylor
>>> OS Community Flag-Bearer
>>> Numenta
>>>
>>> On Fri, Dec 18, 2015 at 2:46 AM, David Ray <[email protected]>
>>> wrote:
>>>
>>>>
>>>> Hi Karin,
>>>>
>>>> the network can't really grow new connections, which are not yet stored
>>>> in the memory, right? (other than adjusting weights of the connections)
>>>>
>>>>
>>>> The network does in fact grow new connections,  Distal Dendrites are
>>>> formed with Synapses housing new connections to other Cells. This is one of
>>>> the most distinguishing features of HTM Neurons as opposed to "point
>>>> neurons" (i.e A-to-Z NNs a.k.a "Deep" Neural Networks).
>>>>
>>>> See:
>>>> https://github.com/numenta/nupic/blob/master/src/nupic/research/temporal_memory.py#L361
>>>>
>>>> ...starting above from the "pickCellsToLearnOn()" method...
>>>>
>>>> Cheers,
>>>> David
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Dec 18, 2015, at 4:10 AM, Karin Valisova <[email protected]> wrote:
>>>>
>>>> Thank you for your answers!
>>>>
>>>> Mathew, what do you mean by, 'how much data the model has seen'? I
>>>> have noticed that the size of network increases with the size of data
>>>> sample, but I can't really see a reason for that - the network can't really
>>>> grow new connections, which are not yet stored in the memory, right? (other
>>>> than adjusting weights of the connections) And if it's a matter of
>>>> accumulation of the data somewhere by the model, for calculation of sliding
>>>> window metrics or thing like these then it can be theoretically cut off -
>>>> if we're talking only about network's ability to process data.
>>>>
>>>> Mark, what kind of compression do you have on your mind? any ideas what
>>>> to try?
>>>>
>>>> Thank you,
>>>> Karin
>>>>
>>>> On Thu, Dec 17, 2015 at 7:29 PM, Marek Otahal <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Karin,
>>>>>
>>>>> yes, that is an issue! I've suggested to use compression, it helps
>>>>> suprisingly well in this matter (from hundreds of MB to 10s,...)
>>>>> Afaik it's not implemented yet.
>>>>>
>>>>> Cheers,
>>>>> Mark
>>>>>
>>>>> On Thu, Dec 17, 2015 at 6:15 PM, Matthew Taylor <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> That's not too surprising ;). The size of a saved model depends on
>>>>>> several things, including # of input fields, model parameters that
>>>>>> affect how cells connect, and how much data the model has seen. There
>>>>>> are thousands of connections between cells that need to be persisted
>>>>>> when a model is saved. I have seen serialized models be much larger
>>>>>> than 50MB.
>>>>>> ---------
>>>>>> Matt Taylor
>>>>>> OS Community Flag-Bearer
>>>>>> Numenta
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 17, 2015 at 8:06 AM, Karin Valisova <[email protected]>
>>>>>> wrote:
>>>>>> > Hello!
>>>>>> >
>>>>>> > I've been playing around with serialization under opf framework and
>>>>>> I
>>>>>> > noticed that when using the typical model for temporal anomaly
>>>>>> detection
>>>>>> >
>>>>>> >
>>>>>> https://github.com/numenta/nupic/blob/master/examples/opf/clients/hotgym/anomaly/one_gym/model_params/rec_center_hourly_model_params.py
>>>>>> >
>>>>>> > The size of saved file gets surprisingly large ~ 50 Mb. What is the
>>>>>> reason
>>>>>> > for this? If I understand correctly, only the states of temporal
>>>>>> and spatial
>>>>>> > pooler should be enough to reload a network, right? Or am I
>>>>>> forgetting about
>>>>>> > some extra data stored?
>>>>>> >
>>>>>> > Thank you!
>>>>>> > Karin
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Marek Otahal :o)
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> datapine GmbH
>>>> Skalitzer Straße 33
>>>> 10999 Berlin
>>>>
>>>> email: [email protected]
>>>>
>>>>
>>>
>>
>>
>> --
>> Marek Otahal :o)
>>
>
>

Re: Serialization in opf - file size issue

Reply via email to