Hey everybody,

I'm a senior at N.C. State University in Raleigh, North Carolina. I have a
Senior Project this semester, and the goal of it is to use NuPIC to detect
anomalous traffic with web servers. One of the goals we've been given is to
identify in real time anomalous server response times (i.e. the server took
1000 milliseconds to respond when it usually only takes 100 milliseconds).

We are encountering 2 problems. The first is that we've been given some
sample logs containing 2 weeks worth of server traffic. The logs contain
things such as the URL requested, the http method, the response code, the
response time, the size of the file requested etc.... Now in the sample
logs we've been given, one day typically has about 750,000 entries (which
is about 10 entries per second, give or take some, as long as I've done the
math correctly). So for 2 weeks of entries, we have a lot of data to try to
run through the model. When we've tried to run every single entry through a
NuPIC model, it takes a really long time, and we end up killing the process
because it's taking too long. We added some timing code to our script, and
it looks like NuPIC starts off by handling 1000 lines in about 30-40
seconds. But as more and more entries are put into the model, the time to
process 1000 entries starts to increase up towards 100 seconds or so. The
longest we've run the model is for about 5 hours, and the model made it
through a little over 5 hours of entries in the web logs. We ran the model
on Virtual Machines on our own laptops and with 4 fields in the model
(datetime, url requested, http method, and response time) with the
prediction being for the response time. So I understand that the more
fields that are included in the model, the slower it will run, and that
running on a virtual machine on our own laptops will not be the quickest.

So I guess all of the above is a long way of describing our problem just to
ask if NuPIC should be able to handle, let's say 100s, of entries per
second in real time? Do any of you guys have any experience with anything
like this?

And our second question is related to whether or not this is something that
NuPIC should be able to identify anomalies in? We're trying to identify
anomalies in the response time (and the model we've set up is a
TemporalAnomaly), and when we run swarms on our data that includes fields
such as the url or the http method to see if those help at all, the swarm
returns with None encoders for those fields. I understand that means that
NuPIC determined through swarming that those fields didn't help it's model
of the data. But we would have thought that something like which URL is
requested would matter since some are regularly faster than others. We just
wanted to find out if any of ya'll had any insight into this. And then
we're starting to question whether or not NuPIC could really identify
anomalies in the response time? Again, we're telling NuPIC to make a
TemporalAnomaly model, but we're starting to think that web server response
times don't represent data that can be accurately represented in patterns
over time.


I know that's a lot of information to wade through and it wasn't phrased
all that great, and that it rambled all over the place, so if ya'll have
any questions or clarifications, I'd be happy to give you guys more details.

Thanks in advance for any help.

Daniel Rice

Reply via email to