Re: Why swarming and running model in two steps?

Matthew Taylor Wed, 03 Jun 2015 09:05:34 -0700

On Thu, May 21, 2015 at 5:25 AM, Wakan Tanka <[email protected]> wrote:
> Pushing data to NuPIC algorithm requires model. The most easy way of
> creating this model is to using swarming process. Swarming basically tries
> various models combinations over your data and if particular model is not
> good  then it is drop. So after swarming you should have the best model for
> your data. This whole process is achieved with following command:
> $NUPIC/scripts/run_swarm.py $PWD/search_def.json --maxWorkers=6


This is only one way of doing it. You can also run swarms from a
python script at get the model parameters programmatically [1]. This
would allow you to run swarms and use the resulting model parameters
to build models and start the process all in one python script.

>
> Q:
> I do not know what should I imagine under model, maybe it is some parameters
> etc. so this is the another question.
> If swarming is most easy process, what other methods can be used to create
> model?

I wouldn't say swarming is the easiest process. You can also guess
model parameters, but it can take a very long time to adjust them
manually to find the best settings. That is what swarming helps with.
At least it gives you a starting point to manually tweak settings.

One of the most valuable things that swarming tells you is which input
fields are factors in the prediction of the target field. The
resulting model parameters contain encoder settings for your data,
which means the swarm has decided those fields are relevant to the
prediction. Other fields with encoders of "None" are deemed
irrelevant.

Swarming is not perfect, especially with many fields of input data.
You can only swarm for so long, so sometimes the process doesn't have
a chance to evaluate all the permutations over the input field encoder
settings. You have to use some common sense when looking at the swarm
output. If you KNOW that a certain input field should be a factor in
the prediction and the swarm doesn't encode it, maybe you need to run
a larger swarm or cut back on the other input fields.

> What criteria should model met to be classified as good or bad during
> swarming?

Internally, each model keeps an error score that judges how good it is
doing at making predictions. Each prediction is compared to the next
available input row of real data to generate this score. This is also
how the anomaly score is calculated. The model can easily tell how
good it is doing by just looking at what it has predicted in the past
and what the read data is when it gets it.

> Now, when you have best model for your data, you need to pass this model +
> your input data to NuPIC to get something useful (I know about prediction,
> which is basically configured with "predictionSteps" what else usefull info
> can be gain from your data is another question). This process is achieved
> with following command:
> $NUPIC/scripts/run_opf_experiment.py $PWD/model_0/

This is not the only way of running NuPIC experiments. See some of my
tutorials for other examples [2].

>
> After this you will gen inference file where the prediction will be stored.
>
> The reason for two separate processes for creating best model and running
> the best model that comes to my mind right now is the fact that running
> model on data can be also CPU intensive (I do not know this true, just
> guessing, can someone make statement?).

Swarming is MUCH more CPU intense than running one NuPIC model. That's
because it is running many models at once.

> So you don't have to run two CPU
> intensive processes just one (running model) under condition you did not
> touch your original data.
> I would like to know what would happen If I create model with swarming for
> some data and then change those data and run this model? Is this complete
> nonsense or is this used somewhere?

This happens all the time. NuPIC is an example of "online learning".
As long as the "shape"  and type of the data doesn't change, NuPIC
will adjust its learning as the data changes over time. It learns the
changing patterns of the data. This is a huge reason why it is a
powerful technology.

> I would also ask If I can specify to nupic to stop to learn after certain
> amount of data? AFAIK nupic is still learning so it might happen that it
> will learn also anomalies and consider them as normal?
> E.g. Imagine the sine prediction example. First the NuPIC did not know about
> the data and simply repeat what it sees and it's anomaly score is high. Then
> when certain amount of periods passes NuPIC will learn this pattern, lower
> the anomaly score. Anything that is close to this pattern will have low
> anomaly score. In other words it will be able to predict what will happen in
> given time and answer questions like will be the function in next step
> increasing or decreasing or not changing etc. The problem is it will be
> constantly learning, how can I achieve to stop learning e.g. after certain
> amount of data and give just an anomaly score?

Yes, learning can be turned on and off. In the OPF, you do this using
the "disableLearning()" and "enableLearning()" on the model object
[3].

>
>>
>>
>>     1. Is there any benefit for running swarming and running model in
>>     two steps?
>>
>> yes,  swarm=parameters, run=outputs; although swarming imho saves also
>> the output of the best model, so you can already use that.
>>
> Yes, when you run swarm you get the best model, the only benefit of reusing
> "old" swarm is the CPU load described above. What benefit gives me the swarm
> parameters (model) when I do not use it on data?

I don't understand this question. There is no reason to swarm if you
do not plan on using the resulting model parameters to build a model
and pass it data with the same shape and type used in the swarm.

>>     2. Is running only swarming without running model useful for
>> something?
>>
>> yes, parameters.
>
> Same question as above, what are parameters itself good for?

I don't think I agree with Marek. Model parameters are only useful for
building models, as far as I know. I don't see a point to swarming if
you are not going to build a model.

>
>>
>>     3. Is there any benefit for three separate steps for running model
>>     in python?
>>
>> which steps?
>>
>
> Maybe steps is not the right word but rather function calling. In command
> line there are two, one for create model and another for running this model:
>
> $NUPIC/scripts/run_swarm.py $PWD/search_def.json --maxWorkers=6
> $NUPIC/scripts/run_opf_experiment.py $PWD/model_0/
>
> in python there is one for creating model (I guess, if I am wrong correct me
> please):
> model_params = swarm_over_data()
>
> and those for running model (again, if I am wrong correct me)
> model = ModelFactory.create(model_params)
> model.enableInference({"predictedField": "sine"})
> result = model.run({"sine": sine_value})

This is just how I organized my program when I created the example.
You may choose to organize it differently, it is up to you.

>>     4. When I have swarm data created in the past and I did not touch
>>     the input data how can I reuse it and run model in python?
>>
>> call the model with the params you got from swarming; If you really did
>> not change the data, you could have saved the model (serialize) and then
>> just restore it back again.
>>
> Can you post python code of how to do that. If I understand It correct I
> need to somehow convince following function: ModelFactory.create() to take
> output from already existed files (model_0 dir) and not to pass it the
> result of swarm_over_data() as in previous example because it would swarm
> again.

The model instance has a save() function. Call it like this:

model.save("/path/to/empty/directory")

Then later you can resurrect the model:

ModelFactory.loadFromCheckpoint("/path/to/saved/model")

>> Note on swarming: of course it helps, but in Nupic the general defaults
>> are just "good enough" usually and the model adjusts to the data itself,
>> so usually just running the model is enough.
>>
> But you must have some model (created by swarming) to being able to push
> your data to NuPIC. You cannot skip swarming process (unless you create
> model manually) or do you?

If you are using your own data that is different from some example we
have, and is not an anomaly model, you probably need to swarm.

[1] 
https://github.com/numenta/nupic/wiki/Running-Swarms#running-a-swarm-programmatically
[2] https://github.com/numenta/nupic/wiki/Using-NuPIC#tutorials
[3] 
http://numenta.org/docs/nupic/classnupic_1_1frameworks_1_1opf_1_1model_1_1_model.html

---------
Matt Taylor
OS Community Flag-Bearer
Numenta

Re: Why swarming and running model in two steps?

Reply via email to