Cal Lidderdale <[EMAIL PROTECTED]> (Tuesday, September 09, 2003 12:12 PM):
> Jeremy Wadsack wrote: >>This is what the Analog cache files are for (see >>http://analog.cx/docs/cache.html). You can create a cache file for >>each day's results and then write reports by reading in the cache >>files. This will certainly improve performance. It may or may not help >>memory problems. >> > First, Thanks to the several replys. > Couple of things have happened, #1 it was ruled that Analog be pulled > from the "production" boxes - Ta-Da no more memory problems. Ha.. right. I guess I assumed you would do this first. > So each night a cron jobs gzips access_log[yesterday] and ftp's it > to a data crunch'n box. We've looked at the stats we need and are > looking at the follow DB tables: > ByHr: date, hr, num_request (don't need pages) > Stat_Code: date, code, count > File_Type: date, type, count > File_Name: date, name, count Within Analog you could use these commands to similarly reduce the data you are storing to the same subset: HOSTLOWMEM 3 REFLOWMEM 3 BROWLOWMEM 3 USERLOWMEM 3 VHOSTLOWMEM 3 (Although the last two, depending on your site, may be empty already.) > Then, and I'm just getting started, > select hr, sum(request) from byhr where date like '2003-09-%' > group by hr"; > or where date between '2003-04-01' and 2003-06-31' second quarter... > or date = '2003-09-01' and hr = 15 :-/ > which returns: from my test data > Each unit () represents 1 request. > hour: reqs: > ----: -----: > 0 47070 +++++ > 1 35709 ++++ > 2 27145 > 3 1 > ..... > 20 94491 > 21 87627 > 22 83777 > 23 73488 > Which is exactly what I'm looking for :-) and my DB file is 1224 > bytes for a 3 day test which means I can do a whole year for under > 20 Meg which is 1/3 of one days access_log. Well, Analog's cache files use an internal format (essentially a serialized memory dump) that will probably result in even less disk usage. Especially if you want to maintain the request-per-day correlation in the Request Report. Although cache files will track more than just per-hour data, the *LOWMEM commands above can keep them slimmed down. (You can also gzip cache files and Analog will uncompress them on-the-fly without complaint -- improving storage *and* performance) This wouldn't quite give you the SQL access to the data, but it would allow you to use Analog (and post processors) to produce all the normal reports (where you have stored data). Meaning you would not have to build your own ETL process or reporting system. > We don't need Domain as 100% of the data is "in-house" as is > Organisation. Status is good for 404's 408's and [what'da'L....] > 503's [ just saw that one hmmmmm ] File type = 5 ea. and Directory > Report for which data is being used. Ok, I want to elaborate the point I made before, in case it was unclear. There is a difference between the valid data necessary to build aggregate reports and what you get in the reports themselves. If you use Analog to produce reports (say XML or CSV ("COMPUTER") output format) that you import into a database, unless you make sure that you have all FLOOR command set to '1r' (or equivalent) to show all entries, you will not be able to produce accurate data. If you do produce "reports" with all entries for each report, the volume of data that Analog writes in the reports is much larger than the size of a cache file. (And the processing time and memory requirements for building the reports are much higher too.) I have worked with several clients on the web log -> database ETL-adventure. In general, we have found that custom data storage (e.g. Analog's cache files) provides the best overall performance. This is because the format is optimized to the data and processing architecture. When considering loading this data into a database, only enterprise-class databases appear to be up to the task (e.g. Oracle). I would only recommend a database approach if (a) you want to integrate other business intelligence data into your reports, (b) you want to use SQL as an ad-hoc query language on the data, or (c) you have a volume of data that is vastly beyond the capacity of Analog but you have a hugely-scalable database system that can cope with it. Note that Analog itself (as a command-line tool) can be very effective for ad-hoc querying and investigative analysis of web log data. However, that does require some degree of fluency in the configuration commands (which is yet another "language" to learn). Analog is well tested on very large data sets: I have effectively used Analog to process logs from a site that was generating 5GB of log files per day (with appropriate hardware, of course). If you do decide that database storage is necessary, you should take a step back from the data and determine what *information* you will want to investigate and report on. This way you can decide at what level of analysis / processing the data should be stored in the DB, what can be stored in other formats and what existing tools can best leverage each aspect of your dissemination. -- Jeremy Wadsack Wadsack-Allen Digital Group +------------------------------------------------------------------------ | TO UNSUBSCRIBE from this list: | http://lists.isite.net/listgate/analog-help/unsubscribe.html | | Digest version: http://lists.isite.net/listgate/analog-help-digest/ | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general | List archives: http://www.analog.cx/docs/mailing.html#listarchives +------------------------------------------------------------------------
