Hello NuPIC,
I have dataset which contains event codes performed by specialized
system. For 1 minute you can get events ranging e.g. from 100k-250k,
depending on what is system doing how busy it is, how it is utilized
etc. Every event has its own integer code, sequence looks like this
(semi comma separated): 1;2;3;4;5;1;2;2;4;5 ... I know that NuPIC isn't
primary used to working with integers so after getting prediction out of
OPF I've round the float to get integers. As I said you can get more
than 100k events for 60 seconds (roughly 1700 events for one second).
This means that if I want to encode those data as time stamp I would
need support for milliseconds in NuPIC. I've discussed it a while ago
here on mailing list [1] and the result was not not use DateEncoder
(simply omit the timestamp field).
QUESTIONS:
##########
1. I've get predictions and anomaly score from NuPIC using code bellow
(only important parts). As you can see, I have two columns (fields) in
csv input file: "order" and "code". But I'm only using "code" field in
swarming process and running model. In my example "order" is just as
it's name says: numbers from 1 to n. I've read on wiki [2] that it is
advised only include several fields inside model due to performance
issues. Does using "order" in my example gives some benefit (there are
two cases that comes to my mind, check 2nd question)?
search_def:
-----------
"includedFields": [
{
"fieldName": "code",
"fieldType": "int",
"maxValue": 255,
"minValue": 0
}
],
"predictedField": "code"
csv input:
----------
order,code
int,int
,
1,0
2,1
3,0
4,0
5,1
6,5
7,4
8,0
...
run model:
----------
for row in csvReader:
col1 = row[0]
col2 = row[1]
result = model.run({"code": float(col2)})
2. How to parallelize model.run? I'm not sure if this is even possible
because I guess that data needs to be passed to model in some
chronological order (in same order as they were in CSV input) and when
you are using parallelization this is not possible to grantee, or am I
wrong? One possible benefit of using "order" (my 1st question) I can see
in synchronization during parallelization (1 goes before 2, 2 before 3
etc.) another candidate for parallelization that comes to my mind is
when learning is disabled (see my 4th question) and model is not
updating anymore, but I'm not 100% sure about those two cases. I've not
studied Jonathan Mackenzie's code extensively but I've noticed that he
is using multiprocessing module in his code for traffic anomaly
detection [3] can you please assist?
3. When I'm dealing with integers (rounding float to ints) as I
described above aren't anomaly score and anomaly likelihood computed
somehow badly?
4. I want to do following steps: 1. train model on 1st dataset 2. turn
off learning 3. push data to model from 2nd dataset. 4. being able find
anomalies on this 2nd dataset using model trained on 1st dataset. This
is very similar to what Matt did in hotgym anomaly tutorial (except that
I will turn off learning when 2nd dataset will be pushed to model).
Let's say that I have following two datasets (for simplicity):
1st dataset sequence snippet (short):
1,2,3,4,5
2nd dataset sequence snippet (long)
1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5
As you can see 2nd dataset is 4 times longer than 1st. Is NuPIC able to
handle this automatically or should I somehow create datasets of the
same size, if yes how? Please note that my input length can vary as I
described. Please also note that using sliding window of size 4 and
making averages from 2nd dataset will solve this particular problem but
this was just an example.
5. Is it necessary to have 2nd dataset (see my 4th question) shifted so
the patterns will start on same positions as on 1st dataset? I will try
to explain it on example (sorry for my English): suppose that you've
build model on 1st dataset, then you've turned of learning, then you
have passed 2nd dataset, but data from 2nd dataset are shifted e.g. if
1st dataset started at 00:00 then 2nd dataset started at 05:00. Will be
NuPIC able to handle this automatically or is manual shift needed? How
to do this kind of shift when your datasets vary in size (see my 4th
question)?
6. I am not able to capture the exact time of particular event. If I
would been able capture event time, would be NuPIC able to handle such
small time interval (does something changed since I've discussed it in
the past - check my 1st question)?
7. Is it possible to somehow combine models that were saved (pickled)? I
will try to explain what I mean: suppose you have saved model trained on
1st dataset. Then you have saved model trained on 2nd dataset. Is it
possible to somehow combine those two models e.g. take 1st model and do
additional training based on 2nd model or vice versa?
[1] Timestamp in micro or mili seconds, 16/02/29, available at:
http://lists.numenta.org/pipermail/nupic_lists.numenta.org/2015-June/011138.html
[2] NuPIC Usage FAQ, 16/02/29, available at:
https://github.com/numenta/nupic/wiki/NuPIC-Usage-FAQ
[3] Traffic Anomaly Detection in Adelaide using HTM, 16/02/29, available
at:
https://github.com/JonnoFTW/htm-models-adelaide/blob/master/engine/index.py
Thank you