Thanks Alex! I'm Casper, not Mark, by the way :)
Thanks for taking the time to make those plots. I certainly see the advantage of notification based on the HTM algorithm over conventional threshold monitoring, in terms of mitigating useless notifications. I'm hopeful that NuPIC can detect anomalies earlier than a person would identify a problem. I'll be looking for a quick and easy way to set up an application similar to Grok in my own environment. I'll certainly look into using NAB to see how well it performs. What keeps bothering me is the problem that HTM is always learning, so the probability score of *recurring* anomalies will decrease every time until it falls below the notification threshold. If it concerned an anomaly that is easily recognizable for a person this would not be a problem, but it mitigates the power of the algorithm to detect anomalies too subtle for people to detect. Classifying a set of desirable and/or undesirable behaviors would counteract this, but is that even possible at this point? In the presentation that Matt linked, I think mr. Subutai mentioned ( https://youtu.be/nVCKjZWYavM?t=1190) that you would have to tweak the data stream based on what you want HTM to learn from it, does that relate to this problem? kind regards, Casper Rooker [email protected] On Wed, Oct 14, 2015 at 7:47 PM, Alex Lavin <[email protected]> wrote: > Hi Mark, > I'd like to point you to NAB [1], our benchmark for anomaly detection in > streaming data. Included in the corpus are 17 data files representing a > variety of server metrics, where we specifically selected these files for > NAB because they test detectors for the problems you described. > > I’ve plotted a few examples you may be interested in [2-4], where the red > dots represent the starting point of true anomalies, and the diamonds mark > detections by the HTM anomaly detection algorithm (green and red are true > and false positives, respectively). > > On your previous questions... > - We typically say HTM needs 1000 data instances to sufficiently learn the > temporal patterns such that it can start reliably making predictions (and > anomaly detections). You'll notice the anomaly scores are relatively high > at the beginning of a data stream, but settle down after HTM has learned > the sequences well. > - A very noisy stream will result in FP detections, but this is true of > any anomaly detection algorithm. To decrease the number of false positives, > you can increase the threshold on the anomaly likelihood. That is, fewer > data points will be flagged as anomalous, but this may come at the cost of > an increase in false negatives. > - The temporal memory has a large capacity for storing patterns of > sequences, so this depends on what you mean by "prolonged use". The anomaly > likelihood estimation uses several parameters [5] related to how much > previous data is used to reestimate the distribution, but tweaking these > generally has little effect on the resulting detections. > > [1] https://github.com/numenta/NAB > [2] > https://plot.ly/~alavin/3151/anomaly-detections-for-realawscloudwatchec2-cpu-utilization-5f5533csv/ > [3] > https://plot.ly/~alavin/3187/anomaly-detections-for-realawscloudwatchelb-request-count-8c0756csv/ > [4] > https://plot.ly/~alavin/3199/anomaly-detections-for-realawscloudwatchrds-cpu-utilization-e47b3bcsv/ > [5] > https://github.com/numenta/nupic/blob/master/src/nupic/algorithms/anomaly_likelihood.py#L84-106 > > Cheers, > Alex >
