Several questions regarding input data and running model

Wakan Tanka Mon, 29 Feb 2016 15:48:12 -0800

Hello NuPIC,

I have dataset which contains event codes performed by specializedsystem. For 1 minute you can get events ranging e.g. from 100k-250k,depending on what is system doing how busy it is, how it is utilizedetc. Every event has its own integer code, sequence looks like this(semi comma separated): 1;2;3;4;5;1;2;2;4;5 ... I know that NuPIC isn'tprimary used to working with integers so after getting prediction out ofOPF I've round the float to get integers. As I said you can get morethan 100k events for 60 seconds (roughly 1700 events for one second).This means that if I want to encode those data as time stamp I wouldneed support for milliseconds in NuPIC. I've discussed it a while agohere on mailing list [1] and the result was not not use DateEncoder(simply omit the timestamp field).




QUESTIONS:
##########

1. I've get predictions and anomaly score from NuPIC using code bellow(only important parts). As you can see, I have two columns (fields) incsv input file: "order" and "code". But I'm only using "code" field inswarming process and running model. In my example "order" is just asit's name says: numbers from 1 to n. I've read on wiki [2] that it isadvised only include several fields inside model due to performanceissues. Does using "order" in my example gives some benefit (there aretwo cases that comes to my mind, check 2nd question)?


search_def:
-----------
  "includedFields": [
    {
      "fieldName": "code",
      "fieldType": "int",
      "maxValue": 255,
      "minValue": 0
    }
  ],

   "predictedField": "code"

csv input:
----------
order,code
int,int
,
1,0
2,1
3,0
4,0
5,1
6,5
7,4
8,0
...


run model:
----------
for row in csvReader:
    col1 = row[0]
    col2 = row[1]
    result = model.run({"code": float(col2)})

2. How to parallelize model.run? I'm not sure if this is even possiblebecause I guess that data needs to be passed to model in somechronological order (in same order as they were in CSV input) and whenyou are using parallelization this is not possible to grantee, or am Iwrong? One possible benefit of using "order" (my 1st question) I can seein synchronization during parallelization (1 goes before 2, 2 before 3etc.) another candidate for parallelization that comes to my mind iswhen learning is disabled (see my 4th question) and model is notupdating anymore, but I'm not 100% sure about those two cases. I've notstudied Jonathan Mackenzie's code extensively but I've noticed that heis using multiprocessing module in his code for traffic anomalydetection [3] can you please assist?

3. When I'm dealing with integers (rounding float to ints) as Idescribed above aren't anomaly score and anomaly likelihood computedsomehow badly?

4. I want to do following steps: 1. train model on 1st dataset 2. turnoff learning 3. push data to model from 2nd dataset. 4. being able findanomalies on this 2nd dataset using model trained on 1st dataset. Thisis very similar to what Matt did in hotgym anomaly tutorial (except thatI will turn off learning when 2nd dataset will be pushed to model).Let's say that I have following two datasets (for simplicity):


1st dataset sequence snippet (short):
1,2,3,4,5


2nd dataset sequence snippet (long)
1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5

As you can see 2nd dataset is 4 times longer than 1st. Is NuPIC able tohandle this automatically or should I somehow create datasets of thesame size, if yes how? Please note that my input length can vary as Idescribed. Please also note that using sliding window of size 4 andmaking averages from 2nd dataset will solve this particular problem butthis was just an example.

5. Is it necessary to have 2nd dataset (see my 4th question) shifted sothe patterns will start on same positions as on 1st dataset? I will tryto explain it on example (sorry for my English): suppose that you'vebuild model on 1st dataset, then you've turned of learning, then youhave passed 2nd dataset, but data from 2nd dataset are shifted e.g. if1st dataset started at 00:00 then 2nd dataset started at 05:00. Will beNuPIC able to handle this automatically or is manual shift needed? Howto do this kind of shift when your datasets vary in size (see my 4thquestion)?

6. I am not able to capture the exact time of particular event. If Iwould been able capture event time, would be NuPIC able to handle suchsmall time interval (does something changed since I've discussed it inthe past - check my 1st question)?

7. Is it possible to somehow combine models that were saved (pickled)? Iwill try to explain what I mean: suppose you have saved model trained on1st dataset. Then you have saved model trained on 2nd dataset. Is itpossible to somehow combine those two models e.g. take 1st model and doadditional training based on 2nd model or vice versa?

[1] Timestamp in micro or mili seconds, 16/02/29, available at:http://lists.numenta.org/pipermail/nupic_lists.numenta.org/2015-June/011138.html[2] NuPIC Usage FAQ, 16/02/29, available at:https://github.com/numenta/nupic/wiki/NuPIC-Usage-FAQ[3] Traffic Anomaly Detection in Adelaide using HTM, 16/02/29, availableat:https://github.com/JonnoFTW/htm-models-adelaide/blob/master/engine/index.py




Thank you

Several questions regarding input data and running model

Reply via email to