I am still struggling to reconcile page counts. Let me explain what I'm
seeing...

The summary says:
        8.62M requests
        268,910 pages
        433,041 corrupt lines [which out of 8.6M lines is 5%]

Daily summary says:
        268,910 pages [agrees with summary]

Hourly summary totals to:
        268,910 pages [so again agrees]

Now, looking at the File Types report (and summarising it) I see:
        7.9M css, images and js file requests   [92% of requests]
        395,722 [no extension]                          [4.6% of requests]
        201,376 [directories]                           [2.3% of requests]
        Then a long tail of these 'misunderstood' .s=tl and similar 'file
types'

So the above 3 lines represent 98.9% of all requests, so our pages are
definitely in these numbers - or are some hiding in those 433K corrupt
lines? [is there a way to have analog spit out the lines it sees as corrupt
to examine them?]

I would consider a no extension or directory to be equivalent to a page
being a URL similar to the one I mentioned before:
        /bdotg/action/home?r.l1=1078549133&r.lc=en&r.s=m [being a 'no
extension' file type]

And a directory presenting a default page (index.html for instance). But...

395,722 + 201,376 = 597,098 which doesn't match the 268,910 page figure
mentioned before. Also, [directory] count is missing from the pie chart wile
.gif and .jpg with lower request volumes are included?

As a further twist, the pages are tagged with a tracking bug (similar to
Google analytics) and this gives me a page count of 460,516 for the day, so
I can't get any of the page count data to match up (and I need to show that
the httpd log analysis ties in with the tracking service).

What I *have* shown is that the *shape* of the analog *requests* data nicely
corresponds with the tracking bug page view count [different scales, but
shows I don't have time differentials shifting data], but the analog page
count data is way off the tracking service figure (whereas the request shape
is a very nice fit). 

What I see is that early hours page views (00:00-08:00) are significantly
higher (8,000/hr vs. 2,000). Maybe this is spidering going on where pages
are being read but the bug js script isn't being run, hence analog is giving
a view of what's really going on.

But then 08:00-23:25 the bug traffic levels are much higher than analog
shows (46K vs. 15K). Maybe this is proxies at work where pages are being
re-served to clients which execute the bug script and so record the page
view, but where no request reaches the web site? Strangely the analog
*requests* data closely matches the bug page views shape (and not the analog
page view shape), but maybe this is css and other widgets many of which are
marked no-cache?

I have 304ISSUCCESS ON and so presume 304 responses will count towards page
count? I have no STATUSINCLUDE defined and so presume all responses will be
counted by analog?

Understanding what's going on is very important as I am using this
information to work out capacity and headroom. Are we serving 46K or 15K
pages/hour? Hence if I scale up what's my max page serve rate?

Thanks for any insight into how analog is counting pages, why my [no
extension] and [directory] figures exceed my page view data, whether I may
be missing lots of pages in corrupt log lines etc.

Thx.../Iain

 

-----Original Message-----
From: analog-help-boun...@lists.meer.net
[mailto:analog-help-boun...@lists.meer.net] On Behalf Of Stephen Turner
Sent: 20 February 2009 19:47
To: analog-help
Subject: Re: [analog-help] Problem with page counts

2009/2/20 Iain Hunneybell <i...@ipmarketing.co.uk>:
> I am trying to analyse pages from a large 'portal' site and am having 
> real problems with page counts and all attempts with PAGEINCLUDE, TYPE 
> and FILEALIAS and other experiements fail.
>
> The site generates URLs similar to:
> /bdotg/action/home?r.l1=1078549133&r.lc=en&r.s=m
>
> It seems to be the period in the input vars that's causing the problem 
> as the File Type report then lists things like:
>
> reqs    %reqs   Gbytes  %bytes  extension
> 7277    0.08%   0.18    0.32%   .s=tl"
> 12683   0.15%   0.11    0.20%
> .t=CAMPAIGN&furlname=selfassessment&furlparam=selfassessment"
> 4485    0.05%   0.11    0.20%   .s=m"
>
> Note the very low percentages as this is in effect counting page by 
> page as a different file type.
>

I'm not seeing this. I just tried this experiment and I see this file listed
as [no extension] which is correct. What do they look like in your raw
logfiles? For example, is the question mark encoded as %3F, which would be a
literal question mark instead of an argument separator?

> So I've tried things like:
>
> PAGEINCLUDE *.s*
> PAGEINCLUDE *.t*
>
> (with and without the trailing *).
>
> I've also tried patterns like:
>
> PAGEINCLUDE /home
>
> But all attempts fail.
>

PAGEINCLUDE /bdotg/action/home

works for me. But if my hypothesis above is correct, you might need

PAGEINCLUDE /bdotg/action/home*

The PAGEINCLUDE has nothing to do with the file types by the way (although
it's typically used that way). You can make any single file into a "page".

--
Stephen Turner



--
Stephen Turner
+-----------------------------------------------------------------------
+-
|  TO UNSUBSCRIBE from this list:
|    http://lists.meer.net/mailman/listinfo/analog-help
|
|  Analog Documentation: http://analog.cx/docs/Readme.html  List 
| archives:  http://www.analog.cx/docs/mailing.html#listarchives
|  Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+-----------------------------------------------------------------------
+-


+------------------------------------------------------------------------
|  TO UNSUBSCRIBE from this list:
|    http://lists.meer.net/mailman/listinfo/analog-help
|
|  Analog Documentation: http://analog.cx/docs/Readme.html
|  List archives:  http://www.analog.cx/docs/mailing.html#listarchives
|  Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

Reply via email to