Re: Several questions regarding input data and running model

Matthew Taylor Mon, 29 Feb 2016 17:02:42 -0800

Before talking about your detailed questions, I would like to talk about
some high level goals you have for HTM. You said your data stream can have
1700 records per second, making this a very fast stream of data. There is a
good chance that you'll need to aggregate or subsample this data to get
something that NuPIC can use. But to do that, we need to talk more
specifically about what the data represents and what your goals are with
this endeavor. Are you trying to predict some future state or just looking
to identify anomalies? Is the task something a human could do given the
same data and unlimited time to analyze?


Unless you have an extremely powerful computer, NuPIC will not be able to
process 1700 Hz data in real-time, so unless you are ok with offline
processing, you will have to reduce the data somehow. And in many cases you
don't actually need that much data, depending on what your goals are and
the pattern frequency within the data. Are there actually sub-second
patterns within the data? Or are the patterns occurring over seconds, 10s
of seconds, or minutes? These are important questions to answer before we
get into the implementation details, and it will help us decide how to
reduce the data.

Thanks,

---------
Matt Taylor
OS Community Flag-Bearer
Numenta

On Mon, Feb 29, 2016 at 3:46 PM, Wakan Tanka <[email protected]> wrote:

> Hello NuPIC,
> I have dataset which contains event codes performed by specialized system.
> For 1 minute you can get events ranging e.g. from 100k-250k, depending on
> what is system doing how busy it is, how it is utilized etc. Every event
> has its own integer code, sequence looks like this (semi comma separated):
> 1;2;3;4;5;1;2;2;4;5 ... I know that NuPIC isn't primary used to working
> with integers so after getting prediction out of OPF I've round the float
> to get integers. As I said you can get more than 100k events for 60 seconds
> (roughly 1700 events for one second). This means that if I want to encode
> those data as time stamp I would need support for milliseconds in NuPIC.
> I've discussed it a while ago here on mailing list [1] and the result was
> not not use DateEncoder (simply omit the timestamp field).
>
>
>
> QUESTIONS:
> ##########
> 1. I've get predictions and anomaly score from NuPIC using code bellow
> (only important parts). As you can see, I have two columns (fields) in csv
> input file: "order" and "code". But I'm only using "code" field in swarming
> process and running model. In my example "order" is just as it's name says:
> numbers from 1 to n. I've read on wiki [2] that it is advised only include
> several fields inside model due to performance issues. Does using "order"
> in my example gives some benefit (there are two cases that comes to my
> mind, check 2nd question)?
>
> search_def:
> -----------
>   "includedFields": [
>     {
>       "fieldName": "code",
>       "fieldType": "int",
>       "maxValue": 255,
>       "minValue": 0
>     }
>   ],
>
>    "predictedField": "code"
>
> csv input:
> ----------
> order,code
> int,int
> ,
> 1,0
> 2,1
> 3,0
> 4,0
> 5,1
> 6,5
> 7,4
> 8,0
> ...
>
>
> run model:
> ----------
> for row in csvReader:
>     col1 = row[0]
>     col2 = row[1]
>     result = model.run({"code": float(col2)})
>
>
> 2. How to parallelize model.run? I'm not sure if this is even possible
> because I guess that data needs to be passed to model in some chronological
> order (in same order as they were in CSV input) and when you are using
> parallelization this is not possible to grantee, or am I wrong? One
> possible benefit of using "order" (my 1st question) I can see in
> synchronization during parallelization (1 goes before 2, 2 before 3 etc.)
> another candidate for parallelization that comes to my mind is when
> learning is disabled (see my 4th question) and model is not updating
> anymore, but I'm not 100% sure about those two cases. I've not studied
> Jonathan Mackenzie's code extensively but I've noticed that he is using
> multiprocessing module in his code for traffic anomaly detection [3] can
> you please assist?
>
>
> 3. When I'm dealing with integers (rounding float to ints) as I described
> above aren't anomaly score and anomaly likelihood computed somehow badly?
>
>
> 4. I want to do following steps: 1. train model on 1st dataset 2. turn off
> learning 3. push data to model from 2nd dataset. 4. being able find
> anomalies on this 2nd dataset using model trained on 1st dataset. This is
> very similar to what Matt did in hotgym anomaly tutorial (except that I
> will turn off learning when 2nd dataset will be pushed to model). Let's say
> that I have following two datasets (for simplicity):
>
> 1st dataset sequence snippet (short):
> 1,2,3,4,5
>
>
> 2nd dataset sequence snippet (long)
> 1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5
>
> As you can see 2nd dataset is 4 times longer than 1st. Is NuPIC able to
> handle this automatically or should I somehow create datasets of the same
> size, if yes how? Please note that my input length can vary as I described.
> Please also note that using sliding window of size 4 and making averages
> from 2nd dataset will solve this particular problem but this was just an
> example.
>
>
> 5. Is it necessary to have 2nd dataset (see my 4th question) shifted so
> the patterns will start on same positions as on 1st dataset? I will try to
> explain it on example (sorry for my English): suppose that you've build
> model on 1st dataset, then you've turned of learning, then you have passed
> 2nd dataset, but data from 2nd dataset are shifted e.g. if 1st dataset
> started at 00:00 then 2nd dataset started at 05:00. Will be NuPIC able to
> handle this automatically or is manual shift needed? How to do this kind of
> shift when your datasets vary in size (see my 4th question)?
>
>
> 6. I am not able to capture the exact time of particular event. If I would
> been able capture event time, would be NuPIC able to handle such small time
> interval (does something changed since I've discussed it in the past -
> check my 1st question)?
>
>
> 7. Is it possible to somehow combine models that were saved (pickled)? I
> will try to explain what I mean: suppose you have saved model trained on
> 1st dataset. Then you have saved model trained on 2nd dataset. Is it
> possible to somehow combine those two models e.g. take 1st model and do
> additional training based on 2nd model or vice versa?
>
>
> [1] Timestamp in micro or mili seconds, 16/02/29, available at:
> http://lists.numenta.org/pipermail/nupic_lists.numenta.org/2015-June/011138.html
> [2] NuPIC Usage FAQ, 16/02/29, available at:
> https://github.com/numenta/nupic/wiki/NuPIC-Usage-FAQ
> [3] Traffic Anomaly Detection in Adelaide using HTM, 16/02/29, available
> at:
> https://github.com/JonnoFTW/htm-models-adelaide/blob/master/engine/index.py
>
>
>
> Thank you
>
>

Re: Several questions regarding input data and running model

Reply via email to