I am still struggling to reconcile page counts. Let me explain what I'm seeing...
The summary says: 8.62M requests 268,910 pages 433,041 corrupt lines [which out of 8.6M lines is 5%] Daily summary says: 268,910 pages [agrees with summary] Hourly summary totals to: 268,910 pages [so again agrees] Now, looking at the File Types report (and summarising it) I see: 7.9M css, images and js file requests [92% of requests] 395,722 [no extension] [4.6% of requests] 201,376 [directories] [2.3% of requests] Then a long tail of these 'misunderstood' .s=tl and similar 'file types' So the above 3 lines represent 98.9% of all requests, so our pages are definitely in these numbers - or are some hiding in those 433K corrupt lines? [is there a way to have analog spit out the lines it sees as corrupt to examine them?] I would consider a no extension or directory to be equivalent to a page being a URL similar to the one I mentioned before: /bdotg/action/home?r.l1=1078549133&r.lc=en&r.s=m [being a 'no extension' file type] And a directory presenting a default page (index.html for instance). But... 395,722 + 201,376 = 597,098 which doesn't match the 268,910 page figure mentioned before. Also, [directory] count is missing from the pie chart wile .gif and .jpg with lower request volumes are included? As a further twist, the pages are tagged with a tracking bug (similar to Google analytics) and this gives me a page count of 460,516 for the day, so I can't get any of the page count data to match up (and I need to show that the httpd log analysis ties in with the tracking service). What I *have* shown is that the *shape* of the analog *requests* data nicely corresponds with the tracking bug page view count [different scales, but shows I don't have time differentials shifting data], but the analog page count data is way off the tracking service figure (whereas the request shape is a very nice fit). What I see is that early hours page views (00:00-08:00) are significantly higher (8,000/hr vs. 2,000). Maybe this is spidering going on where pages are being read but the bug js script isn't being run, hence analog is giving a view of what's really going on. But then 08:00-23:25 the bug traffic levels are much higher than analog shows (46K vs. 15K). Maybe this is proxies at work where pages are being re-served to clients which execute the bug script and so record the page view, but where no request reaches the web site? Strangely the analog *requests* data closely matches the bug page views shape (and not the analog page view shape), but maybe this is css and other widgets many of which are marked no-cache? I have 304ISSUCCESS ON and so presume 304 responses will count towards page count? I have no STATUSINCLUDE defined and so presume all responses will be counted by analog? Understanding what's going on is very important as I am using this information to work out capacity and headroom. Are we serving 46K or 15K pages/hour? Hence if I scale up what's my max page serve rate? Thanks for any insight into how analog is counting pages, why my [no extension] and [directory] figures exceed my page view data, whether I may be missing lots of pages in corrupt log lines etc. Thx.../Iain -----Original Message----- From: analog-help-boun...@lists.meer.net [mailto:analog-help-boun...@lists.meer.net] On Behalf Of Stephen Turner Sent: 20 February 2009 19:47 To: analog-help Subject: Re: [analog-help] Problem with page counts 2009/2/20 Iain Hunneybell <i...@ipmarketing.co.uk>: > I am trying to analyse pages from a large 'portal' site and am having > real problems with page counts and all attempts with PAGEINCLUDE, TYPE > and FILEALIAS and other experiements fail. > > The site generates URLs similar to: > /bdotg/action/home?r.l1=1078549133&r.lc=en&r.s=m > > It seems to be the period in the input vars that's causing the problem > as the File Type report then lists things like: > > reqs %reqs Gbytes %bytes extension > 7277 0.08% 0.18 0.32% .s=tl" > 12683 0.15% 0.11 0.20% > .t=CAMPAIGN&furlname=selfassessment&furlparam=selfassessment" > 4485 0.05% 0.11 0.20% .s=m" > > Note the very low percentages as this is in effect counting page by > page as a different file type. > I'm not seeing this. I just tried this experiment and I see this file listed as [no extension] which is correct. What do they look like in your raw logfiles? For example, is the question mark encoded as %3F, which would be a literal question mark instead of an argument separator? > So I've tried things like: > > PAGEINCLUDE *.s* > PAGEINCLUDE *.t* > > (with and without the trailing *). > > I've also tried patterns like: > > PAGEINCLUDE /home > > But all attempts fail. > PAGEINCLUDE /bdotg/action/home works for me. But if my hypothesis above is correct, you might need PAGEINCLUDE /bdotg/action/home* The PAGEINCLUDE has nothing to do with the file types by the way (although it's typically used that way). You can make any single file into a "page". -- Stephen Turner -- Stephen Turner +----------------------------------------------------------------------- +- | TO UNSUBSCRIBE from this list: | http://lists.meer.net/mailman/listinfo/analog-help | | Analog Documentation: http://analog.cx/docs/Readme.html List | archives: http://www.analog.cx/docs/mailing.html#listarchives | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general +----------------------------------------------------------------------- +- +------------------------------------------------------------------------ | TO UNSUBSCRIBE from this list: | http://lists.meer.net/mailman/listinfo/analog-help | | Analog Documentation: http://analog.cx/docs/Readme.html | List archives: http://www.analog.cx/docs/mailing.html#listarchives | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general +------------------------------------------------------------------------