Re: Several questions regarding swarming

Matthew Taylor Mon, 29 Feb 2016 17:26:04 -0800

Answers below...

On Mon, Feb 29, 2016 at 3:47 PM, Wakan Tanka <[email protected]> wrote:
>
> 1. Why there are included both columns ("timestamp" and 
> "kw_energy_consumption") in 2nd swarm config, while there is only one column 
> ("sine") in 1st example under "includedFields"? If I understand correct then 
> in 1st example swarm will only operate on "sine" (not "angle") column and in 
> 2nd example swarm will operate on both columns ("timestamp" and 
> "kw_energy_consumption"), is this correct? Is it worth to incorporate "angle" 
> in 1st example or vice versa remove "timestamp" in 2nd example? What would 
> happen? I guess that in 2nd example only "kw_energy_consumption" is needed 
> because this is what we want predict and in 1st config we want to predict 
> "sine" so "angle" will be meaningless. Does more columns automatically mean 
> better model or what is going on?


 The timestamp column is only included if time is relevant to the
data. For the sine wave, time is not relevant. The sine patterns is
consistent whether it starts at 1PM on a Tuesday or 8:34AM on a
Saturday. It simply doesn't matter, so it is completely left out.

For the gym data, time is quite important because there are daily and
weekly patterns. The HTM must know and encode the times for each data
point to understand when in the daily/weekly cycle each point occurs.

Adding the angle to the sine demo would not improve predictions
because the sine is a function of the angle. As the angle changes, the
sine value changes in the same fashion.

Removing the timestamp from the hotgym example will hurt predictions
because NuPIC will not be able to identify that the gym usually starts
hopping around 6AM in the morning (or whatever). If the data are
coming in at regular intervals, it may not matter, because it will
understand that every 1440th data point (or whatever) there is an
uptick in power usage. But if the data is irregular (like most data),
it will completely lose the concept of time unless it is specifically
encoded.

> 2. What is relationship between includedFields vs 
> ['streamDef']['streams'][0]['columns']? Isn’t this redundant? What else 
> except '*' can be contained under ['streamDef']['streams'][0]['columns'] when 
> should I change this?

The streamDef is a way to interface actual data into NuPIC. It is a
bit redundant, but imagine if you had a data file with 100 columns of
data. You could only include the ones you wanted instead of "*" and
the rest would be ignored.


> 3. What (SDR) encoder is used as a default? I guess it should be possible to 
> change it because as it is mentioned in [1]: "There are a number of factors 
> that swarming considers when creating potential models to evaluate ... which 
> model components should be used (encoders, spatial & temporal poolers, 
> classifier, etc.), and what parameter values should be chosen for each 
> component."
> And also in [2]: "Swarming figures out which optional components should go 
> into a model (encoders, spatial pooler, temporal pooler, classifier, etc.),"
> The only way regarding changing encoder I’ve found is trying to decipher the 
> JSON schema [3] and list of available encoders [4].

I don't know the whole answer to this question, but I don't think that
swarming  tries RandomDistributedScalarEncoder. For scalar values it
just uses the ScalarEncoder, then it permutes over lots of different
encoder settings to try to find the best fit for the data. It does the
same thing with CategoryEncoder if the input is a string.

Once you get model parameters back from a swarm, you can change the
encoder parameters there and try different settings. All the encoder
settings are in the "encoders" section of the model params (example
[1]).

> 4. In JSON schema description [3] and in [2] there is shown using custom 
> metrics. I guess those metrics affects the best model election during swarm, 
> or am I wrong? Are there any code examples which uses further fields 
> mentioned in JSON schema [3]?

I think the [3] you linked to is something internal to swarming. It is
a config file that the swarming process uses while it is permuting
over model and encoder settings.

The term "Custom Metrics" is a term from the Grok product, not NuPIC.
It is a way to define custom input streams for Grok for IT [2] (which
uses NuPIC).

> 5. Is it possible to have different columns under includedFields  and 
> predictedField. In other words: does it make any sense to make model operate 
> (predict or detect anomalies) on another columns that swarm was running on? I 
> guess not but one never knows.

You can change the model params that a swarm returns to do something
different if you want (like predict a different field), but you
probably won't have much success because those params were uncovered
specifically for the predictedField you told it to swarm for. But
sure, if you change the predictedField and re-run the model it should
still work as long as the new predicted field has the same data type.
But don't expect it to perform very well.

> 6. Can somebody please explain me following statement from [2] "Swarming also 
> figures out which fields of the input are useful in making good predictions. 
> If a field is not useful, it is not included in the final model."
> I’m the one who specify what to include in swarming (under includedFields) 
> not some algorithm or am I wrong?

You might specify 4 included fields, but only one predicted field.
This means you want NuPIC to find the best model params to predict one
of these fields, and also here are 3 other streams of input data you
can use to make the prediction if you want. The swarm (in many cases)
will decide that none of those other data streams are useful for
predicting the field you wanted. Or it might decide that one or two
are useful. In the hotgym example, the timestamp is certainly a useful
field to encode and use when analyzing the power consumption, because
time of day and day of week are very important data to have when
making a power consumption prediction. I hope that makes sense.

> 7. Can I understand permutations.py [2] as a lower level control of swarm, 
> are there any examples?

Yes sort of, I guess... honestly I don't use permutations.py. It was
confusing to me when I starting using swarming so I was the one who
created the "programmatic approach" to swarming [3]. It still uses the
same low-level stuff under the hood, but you don't have to know about
it.

[1] 
https://github.com/numenta/nupic/blob/master/examples/opf/clients/hotgym/simple/model_params.py#L57-L88
[2] http://grokstream.com/
[3] 
https://github.com/numenta/nupic/wiki/Running-Swarms#running-a-swarm-programmatically

I hope that was helpful.
---------
Matt Taylor
OS Community Flag-Bearer
Numenta

Re: Several questions regarding swarming

Reply via email to