Re: [analog-help] File size problem

analog-help Tue, 09 Sep 2003 13:29:47 -0700

Cal Lidderdale <[EMAIL PROTECTED]> (Tuesday, September 09, 2003 12:12 PM):

> Jeremy Wadsack wrote:

>>This is what the Analog cache files are for (see
>>http://analog.cx/docs/cache.html). You can create a cache file for
>>each day's results and then write reports by reading in the cache
>>files. This will certainly improve performance. It may or may not help
>>memory problems.
>>  

> First, Thanks to the several replys.

> Couple of things have happened, #1 it was ruled that Analog be pulled
> from the "production" boxes - Ta-Da no more memory problems.

Ha.. right. I guess I assumed you would do this first.

> So each night a cron jobs gzips access_log[yesterday] and ftp's it
> to a data crunch'n box.  We've looked at the stats we need and are
> looking at the follow DB tables:

>     ByHr:   date, hr, num_request    (don't need pages)
>     Stat_Code:   date, code, count
>     File_Type:   date, type, count
>     File_Name:  date, name, count

Within Analog you could use these commands to similarly reduce the
data you are storing to the same subset:

   HOSTLOWMEM 3
   REFLOWMEM 3
   BROWLOWMEM 3
   USERLOWMEM 3
   VHOSTLOWMEM 3

(Although the last two, depending on your site, may be empty already.)

> Then, and I'm just getting started,
>        select hr, sum(request) from byhr where date like '2003-09-%' 
> group by hr";
>    or where date between '2003-04-01' and 2003-06-31'   second quarter...
>    or date = '2003-09-01' and hr = 15   :-/
> which returns: from my test data

>     Each unit () represents 1 request.

> hour:  reqs: 
> ----: -----:

>     0  47070        +++++
>     1  35709        ++++
>     2  27145
>     3  1
>     .....
>     20  94491
>     21  87627
>     22  83777
>     23  73488

> Which is exactly what I'm looking for  :-)  and my DB file is 1224
> bytes for a 3 day test which means I can do a whole year for under
> 20 Meg which is 1/3 of  one days access_log.

Well, Analog's cache files use an internal format (essentially a
serialized memory dump) that will probably result in even less disk
usage. Especially if you want to maintain the request-per-day
correlation in the Request Report. Although cache files will track
more than just per-hour data, the *LOWMEM commands above can keep them
slimmed down. (You can also gzip cache files and Analog will
uncompress them on-the-fly without complaint -- improving storage
*and* performance)

This wouldn't quite give you the SQL access to the data, but it would
allow you to use Analog (and post processors) to produce all the
normal reports (where you have stored data). Meaning you would not
have to build your own ETL process or reporting system.

> We don't need Domain as 100% of the data is "in-house" as is
> Organisation. Status is good for 404's 408's and [what'da'L....]
> 503's [ just saw that  one hmmmmm ] File type = 5 ea. and Directory
> Report for which data is being used.

Ok, I want to elaborate the point I made before, in case it was
unclear. There is a difference between the valid data necessary to
build aggregate reports and what you get in the reports themselves. If
you use Analog to produce reports (say XML or CSV ("COMPUTER") output
format) that you import into a database, unless you make sure that you
have all FLOOR command set to '1r' (or equivalent) to show all
entries, you will not be able to produce accurate data. If you do
produce "reports" with all entries for each report, the volume of data
that Analog writes in the reports is much larger than the size of a
cache file. (And the processing time and memory requirements for
building the reports are much higher too.)

I have worked with several clients on the web log -> database
ETL-adventure. In general, we have found that custom data storage
(e.g. Analog's cache files) provides the best overall performance.
This is because the format is optimized to the data and processing
architecture. When considering loading this data into a database, only
enterprise-class databases appear to be up to the task (e.g. Oracle).

I would only recommend a database approach if (a) you want to
integrate other business intelligence data into your reports, (b) you
want to use SQL as an ad-hoc query language on the data, or (c) you
have a volume of data that is vastly beyond the capacity of Analog but
you have a hugely-scalable database system that can cope with it.

Note that Analog itself (as a command-line tool) can be very effective
for ad-hoc querying and investigative analysis of web log data.
However, that does require some degree of fluency in the configuration
commands (which is yet another "language" to learn).

Analog is well tested on very large data sets: I have effectively used
Analog to process logs from a site that was generating 5GB of log
files per day (with appropriate hardware, of course).

If you do decide that database storage is necessary, you should take a
step back from the data and determine what *information* you will want
to investigate and report on. This way you can decide at what level of
analysis / processing the data should be stored in the DB, what can be
stored in other formats and what existing tools can best leverage each
aspect of your dissemination.

-- 

Jeremy Wadsack
Wadsack-Allen Digital Group

+------------------------------------------------------------------------
|  TO UNSUBSCRIBE from this list:
|    http://lists.isite.net/listgate/analog-help/unsubscribe.html
|
|  Digest version: http://lists.isite.net/listgate/analog-help-digest/
|  Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
|  List archives:  http://www.analog.cx/docs/mailing.html#listarchives
+------------------------------------------------------------------------

Re: [analog-help] File size problem

Reply via email to