Answers below... On Mon, Feb 29, 2016 at 3:47 PM, Wakan Tanka <[email protected]> wrote: > > 1. Why there are included both columns ("timestamp" and > "kw_energy_consumption") in 2nd swarm config, while there is only one column > ("sine") in 1st example under "includedFields"? If I understand correct then > in 1st example swarm will only operate on "sine" (not "angle") column and in > 2nd example swarm will operate on both columns ("timestamp" and > "kw_energy_consumption"), is this correct? Is it worth to incorporate "angle" > in 1st example or vice versa remove "timestamp" in 2nd example? What would > happen? I guess that in 2nd example only "kw_energy_consumption" is needed > because this is what we want predict and in 1st config we want to predict > "sine" so "angle" will be meaningless. Does more columns automatically mean > better model or what is going on?
The timestamp column is only included if time is relevant to the data. For the sine wave, time is not relevant. The sine patterns is consistent whether it starts at 1PM on a Tuesday or 8:34AM on a Saturday. It simply doesn't matter, so it is completely left out. For the gym data, time is quite important because there are daily and weekly patterns. The HTM must know and encode the times for each data point to understand when in the daily/weekly cycle each point occurs. Adding the angle to the sine demo would not improve predictions because the sine is a function of the angle. As the angle changes, the sine value changes in the same fashion. Removing the timestamp from the hotgym example will hurt predictions because NuPIC will not be able to identify that the gym usually starts hopping around 6AM in the morning (or whatever). If the data are coming in at regular intervals, it may not matter, because it will understand that every 1440th data point (or whatever) there is an uptick in power usage. But if the data is irregular (like most data), it will completely lose the concept of time unless it is specifically encoded. > 2. What is relationship between includedFields vs > ['streamDef']['streams'][0]['columns']? Isn’t this redundant? What else > except '*' can be contained under ['streamDef']['streams'][0]['columns'] when > should I change this? The streamDef is a way to interface actual data into NuPIC. It is a bit redundant, but imagine if you had a data file with 100 columns of data. You could only include the ones you wanted instead of "*" and the rest would be ignored. > 3. What (SDR) encoder is used as a default? I guess it should be possible to > change it because as it is mentioned in [1]: "There are a number of factors > that swarming considers when creating potential models to evaluate ... which > model components should be used (encoders, spatial & temporal poolers, > classifier, etc.), and what parameter values should be chosen for each > component." > And also in [2]: "Swarming figures out which optional components should go > into a model (encoders, spatial pooler, temporal pooler, classifier, etc.)," > The only way regarding changing encoder I’ve found is trying to decipher the > JSON schema [3] and list of available encoders [4]. I don't know the whole answer to this question, but I don't think that swarming tries RandomDistributedScalarEncoder. For scalar values it just uses the ScalarEncoder, then it permutes over lots of different encoder settings to try to find the best fit for the data. It does the same thing with CategoryEncoder if the input is a string. Once you get model parameters back from a swarm, you can change the encoder parameters there and try different settings. All the encoder settings are in the "encoders" section of the model params (example [1]). > 4. In JSON schema description [3] and in [2] there is shown using custom > metrics. I guess those metrics affects the best model election during swarm, > or am I wrong? Are there any code examples which uses further fields > mentioned in JSON schema [3]? I think the [3] you linked to is something internal to swarming. It is a config file that the swarming process uses while it is permuting over model and encoder settings. The term "Custom Metrics" is a term from the Grok product, not NuPIC. It is a way to define custom input streams for Grok for IT [2] (which uses NuPIC). > 5. Is it possible to have different columns under includedFields and > predictedField. In other words: does it make any sense to make model operate > (predict or detect anomalies) on another columns that swarm was running on? I > guess not but one never knows. You can change the model params that a swarm returns to do something different if you want (like predict a different field), but you probably won't have much success because those params were uncovered specifically for the predictedField you told it to swarm for. But sure, if you change the predictedField and re-run the model it should still work as long as the new predicted field has the same data type. But don't expect it to perform very well. > 6. Can somebody please explain me following statement from [2] "Swarming also > figures out which fields of the input are useful in making good predictions. > If a field is not useful, it is not included in the final model." > I’m the one who specify what to include in swarming (under includedFields) > not some algorithm or am I wrong? You might specify 4 included fields, but only one predicted field. This means you want NuPIC to find the best model params to predict one of these fields, and also here are 3 other streams of input data you can use to make the prediction if you want. The swarm (in many cases) will decide that none of those other data streams are useful for predicting the field you wanted. Or it might decide that one or two are useful. In the hotgym example, the timestamp is certainly a useful field to encode and use when analyzing the power consumption, because time of day and day of week are very important data to have when making a power consumption prediction. I hope that makes sense. > 7. Can I understand permutations.py [2] as a lower level control of swarm, > are there any examples? Yes sort of, I guess... honestly I don't use permutations.py. It was confusing to me when I starting using swarming so I was the one who created the "programmatic approach" to swarming [3]. It still uses the same low-level stuff under the hood, but you don't have to know about it. [1] https://github.com/numenta/nupic/blob/master/examples/opf/clients/hotgym/simple/model_params.py#L57-L88 [2] http://grokstream.com/ [3] https://github.com/numenta/nupic/wiki/Running-Swarms#running-a-swarm-programmatically I hope that was helpful. --------- Matt Taylor OS Community Flag-Bearer Numenta
