Jeremy Wadsack wrote:
Lucian Wischik ([EMAIL PROTECTED]; Saturday, March 01, 2003 3:52 AM):
I don't get the "cross-correlation" part. I don't want to combineIt's the memory requirement. If you have 10,000 unique requests on
two reports, or do I?
your site (not including separate query strings) and you have 16
buckets in the processing time report, you now have to track 160,000
unique combinations of processing-time -> request. This is even worse
for things like host to referrer!
I don't think that's true. The problem isn't an n^2 problem.
Lucian Wischik got it right. It would require the memory necessary to store 50 url:s, a variable and the program code. It would require one pass through the logfiles. However, it would be possible to sum the execution time for each page and list the top 50 worst *average* processing times. But that is NOT what I'm talking about. That would indeed require some memory but only proportional to the number of unique pages.
The original request was for a list of the top 50 worse performers. So, you
have an heap with 50 elements in it, each element is a pair (time,name). For
every log entry you processes, check whether it's time is greater than the
quickest element on the heap, and if so, add it to the heap.
Exactly my intention. But you meant the slowest element on the list :-)
Does "worst performer" mean the requests that had the longest processing time? Or the fewest number of requests? If it's just longest processing time, then you are right. I though we were counting requests (which when counting requires that every item be tracked, as I said before).
Processing time, IIS can store it as milliseconds.
Alternatively, Jeremy, you were suggesting a set of buckets. If I understand
right, we'd see the worst few performers in the 1-2s range, the worst few
performers in the 2-5s range, and so on. This'd be just the same, except
with one heap per bucket.
The bucket idea is just because the current Processing Time report is handled that way.
The bucket idea would actually be more useful to me. Overall performance would improve more if I could shave 0.5 seconds off a script that is called 50000 times compared to if I cut five seconds of the time for a script that is called ten times during a day. But it would probably be easier to understand a report with only one bucket.
Caveat: I don't know what exactly "processing time" is. At least, my
logfiles don't seem to include it. If it's not explicitly stored in the log,
and instead has to be calculated as the time between two separate
requests... well, that'd involve some separate processing beforehand.
It comes from the log files. Apache lists it in seconds (without fractional component) so it's pretty useless on that platform. IIS lists in miliseconds (or centiseconds on some versions?). It's the number of seconds between when the request was received and when the response was submitted.
There's an unrelated different cross-corelation program I wrote which
annotated the "Request Report" by adding, for each request, a list of the
top downloaders. You'd think this'd be an n^2 problem. But just run Analog
once the first time to get the request report, then run it a second time,
except on this second run it ignores everything but the requests it's been
told to look out for. The computational complexity of this second run is of
the same order as the first run. In practice, I didn't even bother writing
it properly, just stuck everything naively into STL containers, and it works
fine up to half a million log entries. The "host->referrer" you mention
would be like this.
Well, I have to admit that my Big-O notation is really rusty and I usually just think of it in orders of infinity, but isn't that still a O(n^2) problem?
But anyway, there are two ways to approach this problem. One is to try to do it all in a single pass, in which case the memory has to be available to hold a multivariate table of both dimensions (e.g. request vs. referrers). The other way is to run two passes on the log files. If there's lots of memory and a small number of unique items the first approach is much faster (disk access if 1000 times slower than memory access). If there are too many unique items or not enough memory then the second method is the only alternative.
No, It's unrelated to the number of request. It's more like a "dynamic" filter. If the processing time is greater than a certain value then include the page in the list of worst performers. But pages might get kicked out of the list and the threshold to get on the list will grow.
Yes, It would be easy for me to do it in Perl but I want it included in the pretty report!
If you want to use the method you propose you might as well just do it with a Perl script (or STL program or whatever). There would be very little (if any) noticeable performance gain by building this logic into Analog.
+------------------------------------------------------------------------ | TO UNSUBSCRIBE from this list: | http://lists.isite.net/listgate/analog-help/unsubscribe.html | | Digest version: http://lists.isite.net/listgate/analog-help-digest/ | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general | List archives: http://www.analog.cx/docs/mailing.html#listarchives +------------------------------------------------------------------------