RE: [analog-help] Problem with page counts
Well Cygwin is a big help, thanks... Only it now raises more questions! One thing which is odd is that analog is reporting quite high usage of Netscape 4 which seemed odd and so caused me to look further. So analog says: 5 1724302.00% Netscape 1703701.98% Netscape/4 1670231.94% Netscape/4.06 3281 0.04% Netscape/4.0 41Netscape/4.77 3 Netscape/4.5 16Netscape/4.76 2 Netscape/4.61 1 Netscape/4.05 3 Netscape/4.7 1645 0.02% Netscape/7 1643 0.02% Netscape/7.2 2 Netscape/7.1 414 Netscape/8 371 Netscape/8.1 43Netscape/8.1.3 Most of it seems to be Netscape 4.06 which indeed would be old. So I tried: grep 'Netscape/' *.log netscape.log I then used Excel to summarise netscape.log and come up with... user-agent Total Netscape/7.12 [matches analog] Netscape/7.21695[analog says 1695] Netscape/8.0.4 5 [missing from analog] Netscape/8.1387 [analog says 371] Netscape/8.1.3 43 [matches analog] Grand Total 2132[way off as analog sees lots of Netscape/4 traffic] grep does not find any 'Netscape/4' strings at all. Note some counts correspond: Netscape/8.1.3 is 43 under both counts, Netscape/7.1 is 2 under both counts. Is there user-agent signature mapping going on within analog that is relating some string[s] other than 'Netscape/4' to be Netscape v4 user agents? These figures will be used to derive browser compatibility tests and so I'll be challenged on my Netscape 4 figures and so want to be certain :-) Thx.../Iain -Original Message- From: analog-help-boun...@lists.meer.net [mailto:analog-help-boun...@lists.meer.net] On Behalf Of Stephen Turner Sent: 20 February 2009 20:53 To: Support for analog web log analyzer Subject: Re: [analog-help] Problem with page counts 2009/2/20 Iain Hunneybell i...@ipmarketing.co.uk: Sadly I have no UNIX host to hand and these are Gig files and so I can't head/tail/grep easily. Windows grep dies... I'll write something to parse the files so I can have a real look at the records... Can you install Cygwin? -- Stephen Turner +--- +- | TO UNSUBSCRIBE from this list: |http://lists.meer.net/mailman/listinfo/analog-help | | Analog Documentation: http://analog.cx/docs/Readme.html List | archives: http://www.analog.cx/docs/mailing.html#listarchives | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general +--- +- + | TO UNSUBSCRIBE from this list: |http://lists.meer.net/mailman/listinfo/analog-help | | Analog Documentation: http://analog.cx/docs/Readme.html | List archives: http://www.analog.cx/docs/mailing.html#listarchives | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general +
RE: [analog-help] Problem with page counts
I am still struggling to reconcile page counts. Let me explain what I'm seeing... The summary says: 8.62M requests 268,910 pages 433,041 corrupt lines [which out of 8.6M lines is 5%] Daily summary says: 268,910 pages [agrees with summary] Hourly summary totals to: 268,910 pages [so again agrees] Now, looking at the File Types report (and summarising it) I see: 7.9M css, images and js file requests [92% of requests] 395,722 [no extension] [4.6% of requests] 201,376 [directories] [2.3% of requests] Then a long tail of these 'misunderstood' .s=tl and similar 'file types' So the above 3 lines represent 98.9% of all requests, so our pages are definitely in these numbers - or are some hiding in those 433K corrupt lines? [is there a way to have analog spit out the lines it sees as corrupt to examine them?] I would consider a no extension or directory to be equivalent to a page being a URL similar to the one I mentioned before: /bdotg/action/home?r.l1=1078549133r.lc=enr.s=m [being a 'no extension' file type] And a directory presenting a default page (index.html for instance). But... 395,722 + 201,376 = 597,098 which doesn't match the 268,910 page figure mentioned before. Also, [directory] count is missing from the pie chart wile .gif and .jpg with lower request volumes are included? As a further twist, the pages are tagged with a tracking bug (similar to Google analytics) and this gives me a page count of 460,516 for the day, so I can't get any of the page count data to match up (and I need to show that the httpd log analysis ties in with the tracking service). What I *have* shown is that the *shape* of the analog *requests* data nicely corresponds with the tracking bug page view count [different scales, but shows I don't have time differentials shifting data], but the analog page count data is way off the tracking service figure (whereas the request shape is a very nice fit). What I see is that early hours page views (00:00-08:00) are significantly higher (8,000/hr vs. 2,000). Maybe this is spidering going on where pages are being read but the bug js script isn't being run, hence analog is giving a view of what's really going on. But then 08:00-23:25 the bug traffic levels are much higher than analog shows (46K vs. 15K). Maybe this is proxies at work where pages are being re-served to clients which execute the bug script and so record the page view, but where no request reaches the web site? Strangely the analog *requests* data closely matches the bug page views shape (and not the analog page view shape), but maybe this is css and other widgets many of which are marked no-cache? I have 304ISSUCCESS ON and so presume 304 responses will count towards page count? I have no STATUSINCLUDE defined and so presume all responses will be counted by analog? Understanding what's going on is very important as I am using this information to work out capacity and headroom. Are we serving 46K or 15K pages/hour? Hence if I scale up what's my max page serve rate? Thanks for any insight into how analog is counting pages, why my [no extension] and [directory] figures exceed my page view data, whether I may be missing lots of pages in corrupt log lines etc. Thx.../Iain -Original Message- From: analog-help-boun...@lists.meer.net [mailto:analog-help-boun...@lists.meer.net] On Behalf Of Stephen Turner Sent: 20 February 2009 19:47 To: analog-help Subject: Re: [analog-help] Problem with page counts 2009/2/20 Iain Hunneybell i...@ipmarketing.co.uk: I am trying to analyse pages from a large 'portal' site and am having real problems with page counts and all attempts with PAGEINCLUDE, TYPE and FILEALIAS and other experiements fail. The site generates URLs similar to: /bdotg/action/home?r.l1=1078549133r.lc=enr.s=m It seems to be the period in the input vars that's causing the problem as the File Type report then lists things like: reqs%reqs Gbytes %bytes extension 72770.08% 0.180.32% .s=tl 12683 0.15% 0.110.20% .t=CAMPAIGNfurlname=selfassessmentfurlparam=selfassessment 44850.05% 0.110.20% .s=m Note the very low percentages as this is in effect counting page by page as a different file type. I'm not seeing this. I just tried this experiment and I see this file listed as [no extension] which is correct. What do they look like in your raw logfiles? For example, is the question mark encoded as %3F, which would be a literal question mark instead of an argument separator? So I've tried things like: PAGEINCLUDE *.s* PAGEINCLUDE *.t* (with and without the trailing *). I've also tried patterns like: PAGEINCLUDE /home But all attempts fail. PAGEINCLUDE /bdotg/action/home works for me. But if my hypothesis above is correct, you might need PAGEINCLUDE /bdotg/action/home* The PAGEINCLUDE has nothing to do
Re: [analog-help] Problem with page counts
OK, there are lots of things here, but the first important thing to say is that logfile analysis and page tagging will never match up. They use fundamentally different techniques, and each makes errors that the other is not susceptible to. For page views you would normally expect to see the logfile analysis numbers lower, because page tagging will see the page again if the visitor returns to it, but logfile analysis won't. You do have too many corrupt lines. If you turn debugging on, you will see all the corrupt lines, and where in the line they were corrupt. It looks like you have about 100,000 of these strange .s=tl lines, right? Page tagging may be including them as pages, depending what they really are and whether they are tagged, so it may be worth tracking them down in the logfiles. Sorry, no great insights, but at least that might give you some avenues to look down. -- Stephen Turner + | TO UNSUBSCRIBE from this list: |http://lists.meer.net/mailman/listinfo/analog-help | | Analog Documentation: http://analog.cx/docs/Readme.html | List archives: http://www.analog.cx/docs/mailing.html#listarchives | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general +
RE: [analog-help] Problem with page counts
Well to answer my own question... My Netscape/4.06 seems to be reported as a user-agent string of 'Mozilla/4.06' and so there is a 'transciption' being done by analog...but its results seem correct :-) But just to show where log analysis can take you, looking at the requests I see they come from private address space and so it seems something on an internal network is generating these requests. Now the task is to find out what and why! As for page counts, my best rationalisation is that the high overnight count is spiders and so analog is correctly showing page requests that aren't being recorded by the page view 'bug'. Then over day proxies are causing the page bug to record higher page views than seen by the servers. It's the best rationalisation I've come up with so far! I've had analog dump out the log lines it sees as being corrupt and they do indeed seem to be truncated and account for about 5% of the log lines which seems high. Now to understand why the servers would be doing this! .../Iain -Original Message- Sent: 21 February 2009 14:03 To: 'Support for analog web log analyzer' Subject: RE: [analog-help] Problem with page counts Well Cygwin is a big help, thanks... Only it now raises more questions! One thing which is odd is that analog is reporting quite high usage of Netscape 4 which seemed odd and so caused me to look further. So analog says: 5 1724302.00% Netscape 1703701.98% Netscape/4 1670231.94% Netscape/4.06 3281 0.04% Netscape/4.0 41Netscape/4.77 3 Netscape/4.5 16Netscape/4.76 2 Netscape/4.61 1 Netscape/4.05 3 Netscape/4.7 1645 0.02% Netscape/7 1643 0.02% Netscape/7.2 2 Netscape/7.1 414 Netscape/8 371 Netscape/8.1 43Netscape/8.1.3 Most of it seems to be Netscape 4.06 which indeed would be old. So I tried: grep 'Netscape/' *.log netscape.log I then used Excel to summarise netscape.log and come up with... user-agent Total Netscape/7.12 [matches analog] Netscape/7.21695[analog says 1695] Netscape/8.0.4 5 [missing from analog] Netscape/8.1387 [analog says 371] Netscape/8.1.3 43 [matches analog] Grand Total 2132[way off as analog sees lots of Netscape/4 traffic] grep does not find any 'Netscape/4' strings at all. Note some counts correspond: Netscape/8.1.3 is 43 under both counts, Netscape/7.1 is 2 under both counts. Is there user-agent signature mapping going on within analog that is relating some string[s] other than 'Netscape/4' to be Netscape v4 user agents? These figures will be used to derive browser compatibility tests and so I'll be challenged on my Netscape 4 figures and so want to be certain :-) Thx.../Iain -Original Message- From: analog-help-boun...@lists.meer.net [mailto:analog-help-boun...@lists.meer.net] On Behalf Of Stephen Turner Sent: 20 February 2009 20:53 To: Support for analog web log analyzer Subject: Re: [analog-help] Problem with page counts 2009/2/20 Iain Hunneybell i...@ipmarketing.co.uk: Sadly I have no UNIX host to hand and these are Gig files and so I can't head/tail/grep easily. Windows grep dies... I'll write something to parse the files so I can have a real look at the records... Can you install Cygwin? -- Stephen Turner +--- +- | TO UNSUBSCRIBE from this list: |http://lists.meer.net/mailman/listinfo/analog-help | | Analog Documentation: http://analog.cx/docs/Readme.html List | archives: http://www.analog.cx/docs/mailing.html#listarchives | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general +--- +- + | TO UNSUBSCRIBE from this list: |http://lists.meer.net/mailman/listinfo/analog-help | | Analog Documentation: http://analog.cx/docs/Readme.html | List archives: http://www.analog.cx/docs/mailing.html#listarchives | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general +
RE: [analog-help] Problem with page counts
Many thanks for this Stephen. Yes there do seem to be an unusally high nuber of corrupt log lines. The problem seems to be truncation. I don't yet know (haven't had time to check) whether they all terminate at a specific length...possibly so...so maybe it's a server config issue and long URIs causing the log lines to overflow and be truncated rendering them useless. So it's probalby fair to guess a lot of these lines are page reads with long associated URIs that have been truncated. Hence I'm losing page reads. Re browser activiy and caching, with '304ISSUCCESS ON' I presume a GET request with a 304 will be counted as a page read? Of course, if the browser (or an intermediate proxy) doesn't return the request to the server...) I've not yet got to the bottom of the 'mis-typed' URLs. I've grep-ed out some of the 'file type' patterns but then looking at the result see these are the referrer, not the target URL and so I need to do more to try and find the lines analog is seeing as a specific file type. Thanks for your help.../Iain -Original Message- From: analog-help-boun...@lists.meer.net [mailto:analog-help-boun...@lists.meer.net] On Behalf Of Stephen Turner Sent: 21 February 2009 17:17 To: Support for analog web log analyzer Subject: Re: [analog-help] Problem with page counts OK, there are lots of things here, but the first important thing to say is that logfile analysis and page tagging will never match up. They use fundamentally different techniques, and each makes errors that the other is not susceptible to. For page views you would normally expect to see the logfile analysis numbers lower, because page tagging will see the page again if the visitor returns to it, but logfile analysis won't. You do have too many corrupt lines. If you turn debugging on, you will see all the corrupt lines, and where in the line they were corrupt. It looks like you have about 100,000 of these strange .s=tl lines, right? Page tagging may be including them as pages, depending what they really are and whether they are tagged, so it may be worth tracking them down in the logfiles. Sorry, no great insights, but at least that might give you some avenues to look down. -- Stephen Turner +--- +- | TO UNSUBSCRIBE from this list: |http://lists.meer.net/mailman/listinfo/analog-help | | Analog Documentation: http://analog.cx/docs/Readme.html List | archives: http://www.analog.cx/docs/mailing.html#listarchives | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general +--- +- + | TO UNSUBSCRIBE from this list: |http://lists.meer.net/mailman/listinfo/analog-help | | Analog Documentation: http://analog.cx/docs/Readme.html | List archives: http://www.analog.cx/docs/mailing.html#listarchives | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general +
Re: [analog-help] Problem with page counts
On 2/21/2009 1:27 PM, Iain Hunneybell wrote: As for page counts, my best rationalisation is that the high overnight count is spiders and so analog is correctly showing page requests that aren't being recorded by the page view 'bug'. Then over day proxies are causing the page bug to record higher page views than seen by the servers. It's the best rationalisation I've come up with so far! You should be able to test that by using FROM and TO and doing a log analysis for an hour in the middle of the night, and looking at the Full Browser report. Most well behaved spiders identify themselves. You can also do a Full Browser report on requests for /robots.txt and then use that to create a list of BROWEXCLUDE commands so that you can see if the human-driven traffic patterns make more sense. Aengus + | TO UNSUBSCRIBE from this list: |http://lists.meer.net/mailman/listinfo/analog-help | | Analog Documentation: http://analog.cx/docs/Readme.html | List archives: http://www.analog.cx/docs/mailing.html#listarchives | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general +
RE: [analog-help] Problem with page counts
That's a very good idea :-) Many thanks Aengus -Original Message- From: analog-help-boun...@lists.meer.net [mailto:analog-help-boun...@lists.meer.net] On Behalf Of Aengus Sent: 21 February 2009 18:45 To: Support for analog web log analyzer Subject: Re: [analog-help] Problem with page counts On 2/21/2009 1:27 PM, Iain Hunneybell wrote: As for page counts, my best rationalisation is that the high overnight count is spiders and so analog is correctly showing page requests that aren't being recorded by the page view 'bug'. Then over day proxies are causing the page bug to record higher page views than seen by the servers. It's the best rationalisation I've come up with so far! You should be able to test that by using FROM and TO and doing a log analysis for an hour in the middle of the night, and looking at the Full Browser report. Most well behaved spiders identify themselves. You can also do a Full Browser report on requests for /robots.txt and then use that to create a list of BROWEXCLUDE commands so that you can see if the human-driven traffic patterns make more sense. Aengus +--- +- | TO UNSUBSCRIBE from this list: |http://lists.meer.net/mailman/listinfo/analog-help | | Analog Documentation: http://analog.cx/docs/Readme.html List | archives: http://www.analog.cx/docs/mailing.html#listarchives | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general +--- +- + | TO UNSUBSCRIBE from this list: |http://lists.meer.net/mailman/listinfo/analog-help | | Analog Documentation: http://analog.cx/docs/Readme.html | List archives: http://www.analog.cx/docs/mailing.html#listarchives | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general +
Re: [analog-help] Problem with page counts
2009/2/20 Iain Hunneybell i...@ipmarketing.co.uk: I am trying to analyse pages from a large 'portal' site and am having real problems with page counts and all attempts with PAGEINCLUDE, TYPE and FILEALIAS and other experiements fail. The site generates URLs similar to: /bdotg/action/home?r.l1=1078549133r.lc=enr.s=m It seems to be the period in the input vars that's causing the problem as the File Type report then lists things like: reqs%reqs Gbytes %bytes extension 72770.08% 0.180.32% .s=tl 12683 0.15% 0.110.20% .t=CAMPAIGNfurlname=selfassessmentfurlparam=selfassessment 44850.05% 0.110.20% .s=m Note the very low percentages as this is in effect counting page by page as a different file type. I'm not seeing this. I just tried this experiment and I see this file listed as [no extension] which is correct. What do they look like in your raw logfiles? For example, is the question mark encoded as %3F, which would be a literal question mark instead of an argument separator? So I've tried things like: PAGEINCLUDE *.s* PAGEINCLUDE *.t* (with and without the trailing *). I've also tried patterns like: PAGEINCLUDE /home But all attempts fail. PAGEINCLUDE /bdotg/action/home works for me. But if my hypothesis above is correct, you might need PAGEINCLUDE /bdotg/action/home* The PAGEINCLUDE has nothing to do with the file types by the way (although it's typically used that way). You can make any single file into a page. -- Stephen Turner -- Stephen Turner + | TO UNSUBSCRIBE from this list: |http://lists.meer.net/mailman/listinfo/analog-help | | Analog Documentation: http://analog.cx/docs/Readme.html | List archives: http://www.analog.cx/docs/mailing.html#listarchives | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general +
RE: [analog-help] Problem with page counts
Thanks for the quick reply :-) Sadly I have no UNIX host to hand and these are Gig files and so I can't head/tail/grep easily. Windows grep dies... I'll write something to parse the files so I can have a real look at the records... I've some other 'funnies' like 1.98% Netscape/4 browser usage (according to the summary) but if I run something like the top 2000 browser sigs in the full browser report I can't find a single reference to Netscape/4 (or course if I could simply grep the files... :-( ). I'll try the PAGEINCLUDE you suggest, but something would seem to be going wrong from the way parts of the query string are showing up in the File Type report. Yes I get [no extension] pages but from a page tracking service I'm expecting around 460K pages and I'm 'only' seeing 395K. But it's the long tail of .s=m and similar files which suggest some counting is going astray. My thought was that I could 'mop these up' by definining each 'mis-read' filetype as a page but my various attempts have failed. I'm runnig 6.0/Win32 if that's an issue? Thanks again.../Iain -Original Message- From: analog-help-boun...@lists.meer.net [mailto:analog-help-boun...@lists.meer.net] On Behalf Of Stephen Turner Sent: 20 February 2009 19:47 To: analog-help Subject: Re: [analog-help] Problem with page counts 2009/2/20 Iain Hunneybell i...@ipmarketing.co.uk: I am trying to analyse pages from a large 'portal' site and am having real problems with page counts and all attempts with PAGEINCLUDE, TYPE and FILEALIAS and other experiements fail. The site generates URLs similar to: /bdotg/action/home?r.l1=1078549133r.lc=enr.s=m It seems to be the period in the input vars that's causing the problem as the File Type report then lists things like: reqs%reqs Gbytes %bytes extension 72770.08% 0.180.32% .s=tl 12683 0.15% 0.110.20% .t=CAMPAIGNfurlname=selfassessmentfurlparam=selfassessment 44850.05% 0.110.20% .s=m Note the very low percentages as this is in effect counting page by page as a different file type. I'm not seeing this. I just tried this experiment and I see this file listed as [no extension] which is correct. What do they look like in your raw logfiles? For example, is the question mark encoded as %3F, which would be a literal question mark instead of an argument separator? So I've tried things like: PAGEINCLUDE *.s* PAGEINCLUDE *.t* (with and without the trailing *). I've also tried patterns like: PAGEINCLUDE /home But all attempts fail. PAGEINCLUDE /bdotg/action/home works for me. But if my hypothesis above is correct, you might need PAGEINCLUDE /bdotg/action/home* The PAGEINCLUDE has nothing to do with the file types by the way (although it's typically used that way). You can make any single file into a page. -- Stephen Turner -- Stephen Turner +--- +- | TO UNSUBSCRIBE from this list: |http://lists.meer.net/mailman/listinfo/analog-help | | Analog Documentation: http://analog.cx/docs/Readme.html List | archives: http://www.analog.cx/docs/mailing.html#listarchives | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general +--- +- + | TO UNSUBSCRIBE from this list: |http://lists.meer.net/mailman/listinfo/analog-help | | Analog Documentation: http://analog.cx/docs/Readme.html | List archives: http://www.analog.cx/docs/mailing.html#listarchives | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general +
Re: [analog-help] Problem with page counts
2009/2/20 Iain Hunneybell i...@ipmarketing.co.uk: Sadly I have no UNIX host to hand and these are Gig files and so I can't head/tail/grep easily. Windows grep dies... I'll write something to parse the files so I can have a real look at the records... Can you install Cygwin? -- Stephen Turner + | TO UNSUBSCRIBE from this list: |http://lists.meer.net/mailman/listinfo/analog-help | | Analog Documentation: http://analog.cx/docs/Readme.html | List archives: http://www.analog.cx/docs/mailing.html#listarchives | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general +
RE: [analog-help] Problem with page counts
I can certainly try. Actually, this long tail only accounts for around 1% so maybe I shouldn't waste time and come back to this later. I'll share anything I find as to why some records are spilling out like this. Thanks for your help.../Iain -Original Message- From: analog-help-boun...@lists.meer.net [mailto:analog-help-boun...@lists.meer.net] On Behalf Of Stephen Turner Sent: 20 February 2009 20:53 To: Support for analog web log analyzer Subject: Re: [analog-help] Problem with page counts 2009/2/20 Iain Hunneybell i...@ipmarketing.co.uk: Sadly I have no UNIX host to hand and these are Gig files and so I can't head/tail/grep easily. Windows grep dies... I'll write something to parse the files so I can have a real look at the records... Can you install Cygwin? -- Stephen Turner +--- +- | TO UNSUBSCRIBE from this list: |http://lists.meer.net/mailman/listinfo/analog-help | | Analog Documentation: http://analog.cx/docs/Readme.html List | archives: http://www.analog.cx/docs/mailing.html#listarchives | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general +--- +- + | TO UNSUBSCRIBE from this list: |http://lists.meer.net/mailman/listinfo/analog-help | | Analog Documentation: http://analog.cx/docs/Readme.html | List archives: http://www.analog.cx/docs/mailing.html#listarchives | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general +