Daniel, The first thing that strikes me about your description is that you're only creating one model for all the data. Is there a way to split the data up, perhaps by URL? If there are too many URLs to create models for each one, try to identify another way to logically sort this input data into categories with patterns that a human could understand and analyze. I think the main problem is that you're pushing too much data into one model. If you can figure out a way to split it into multiple models you'll probably be more successful.
Let me know what you think. Regards, --------- Matt Taylor OS Community Flag-Bearer Numenta On Fri, Feb 12, 2016 at 8:01 AM, Daniel Rice <[email protected]> wrote: > Hey everybody, > > I'm a senior at N.C. State University in Raleigh, North Carolina. I have a > Senior Project this semester, and the goal of it is to use NuPIC to detect > anomalous traffic with web servers. One of the goals we've been given is to > identify in real time anomalous server response times (i.e. the server took > 1000 milliseconds to respond when it usually only takes 100 milliseconds). > > We are encountering 2 problems. The first is that we've been given some > sample logs containing 2 weeks worth of server traffic. The logs contain > things such as the URL requested, the http method, the response code, the > response time, the size of the file requested etc.... Now in the sample > logs we've been given, one day typically has about 750,000 entries (which > is about 10 entries per second, give or take some, as long as I've done the > math correctly). So for 2 weeks of entries, we have a lot of data to try to > run through the model. When we've tried to run every single entry through a > NuPIC model, it takes a really long time, and we end up killing the process > because it's taking too long. We added some timing code to our script, and > it looks like NuPIC starts off by handling 1000 lines in about 30-40 > seconds. But as more and more entries are put into the model, the time to > process 1000 entries starts to increase up towards 100 seconds or so. The > longest we've run the model is for about 5 hours, and the model made it > through a little over 5 hours of entries in the web logs. We ran the model > on Virtual Machines on our own laptops and with 4 fields in the model > (datetime, url requested, http method, and response time) with the > prediction being for the response time. So I understand that the more > fields that are included in the model, the slower it will run, and that > running on a virtual machine on our own laptops will not be the quickest. > > So I guess all of the above is a long way of describing our problem just > to ask if NuPIC should be able to handle, let's say 100s, of entries per > second in real time? Do any of you guys have any experience with anything > like this? > > And our second question is related to whether or not this is something > that NuPIC should be able to identify anomalies in? We're trying to > identify anomalies in the response time (and the model we've set up is a > TemporalAnomaly), and when we run swarms on our data that includes fields > such as the url or the http method to see if those help at all, the swarm > returns with None encoders for those fields. I understand that means that > NuPIC determined through swarming that those fields didn't help it's model > of the data. But we would have thought that something like which URL is > requested would matter since some are regularly faster than others. We just > wanted to find out if any of ya'll had any insight into this. And then > we're starting to question whether or not NuPIC could really identify > anomalies in the response time? Again, we're telling NuPIC to make a > TemporalAnomaly model, but we're starting to think that web server response > times don't represent data that can be accurately represented in patterns > over time. > > > I know that's a lot of information to wade through and it wasn't phrased > all that great, and that it rambled all over the place, so if ya'll have > any questions or clarifications, I'd be happy to give you guys more details. > > Thanks in advance for any help. > > Daniel Rice >
