Hey everybody, I'm a senior at N.C. State University in Raleigh, North Carolina. I have a Senior Project this semester, and the goal of it is to use NuPIC to detect anomalous traffic with web servers. One of the goals we've been given is to identify in real time anomalous server response times (i.e. the server took 1000 milliseconds to respond when it usually only takes 100 milliseconds).
We are encountering 2 problems. The first is that we've been given some sample logs containing 2 weeks worth of server traffic. The logs contain things such as the URL requested, the http method, the response code, the response time, the size of the file requested etc.... Now in the sample logs we've been given, one day typically has about 750,000 entries (which is about 10 entries per second, give or take some, as long as I've done the math correctly). So for 2 weeks of entries, we have a lot of data to try to run through the model. When we've tried to run every single entry through a NuPIC model, it takes a really long time, and we end up killing the process because it's taking too long. We added some timing code to our script, and it looks like NuPIC starts off by handling 1000 lines in about 30-40 seconds. But as more and more entries are put into the model, the time to process 1000 entries starts to increase up towards 100 seconds or so. The longest we've run the model is for about 5 hours, and the model made it through a little over 5 hours of entries in the web logs. We ran the model on Virtual Machines on our own laptops and with 4 fields in the model (datetime, url requested, http method, and response time) with the prediction being for the response time. So I understand that the more fields that are included in the model, the slower it will run, and that running on a virtual machine on our own laptops will not be the quickest. So I guess all of the above is a long way of describing our problem just to ask if NuPIC should be able to handle, let's say 100s, of entries per second in real time? Do any of you guys have any experience with anything like this? And our second question is related to whether or not this is something that NuPIC should be able to identify anomalies in? We're trying to identify anomalies in the response time (and the model we've set up is a TemporalAnomaly), and when we run swarms on our data that includes fields such as the url or the http method to see if those help at all, the swarm returns with None encoders for those fields. I understand that means that NuPIC determined through swarming that those fields didn't help it's model of the data. But we would have thought that something like which URL is requested would matter since some are regularly faster than others. We just wanted to find out if any of ya'll had any insight into this. And then we're starting to question whether or not NuPIC could really identify anomalies in the response time? Again, we're telling NuPIC to make a TemporalAnomaly model, but we're starting to think that web server response times don't represent data that can be accurately represented in patterns over time. I know that's a lot of information to wade through and it wasn't phrased all that great, and that it rambled all over the place, so if ya'll have any questions or clarifications, I'd be happy to give you guys more details. Thanks in advance for any help. Daniel Rice
