Will wrote:
> Greetings everyone.  This is my first post so I apologize in advance
> if im doing something wrong.
>
> I have inherited stats duties for our company which has about 20
> domains and lots and lots of IIS logs.  We are using Urchin and cannot
> switch at this time.  I want to use Analog to pre-parse my IIS log
> files on a daily basis by removing all log entries made by spiders (as
> identified by some external machine-generated spiders.cfg file).
>
> Urchin has very crappy and limited functionality for filtering
> spiders.  It is clearly not doing a good job identifying crawlers so I
> figured this was my best bet, to pre-parse using analog before Urchin
> gets its grubby hands on my log files.
>
> Can anyone help me with a .cfg file and command line syntax for
> accomplising this?  I dont want it to do any reporting or analyzing,
> just output the identical IIS log but with all spider/bot entries
> removed.

Analog won't modify your logfiles - it will only read them in and report on
the contents. If you want to physically exclude robots/spiders from your
logs, you can use something as simple as the FINDSTR command included in
Windows, alobg with a list of strings that identify spiders. You can create
that list from information on http://www.robotstxt.org/ or you could create
a custom list by using Analog to analyse your logs for behaviour that you
identify as spider-like. (For example, you could run a Full Browser report
to get a list of browser names that are obviously spiders).

You would use FINDSTR like this to create a "no spider" version of your
logfile:

FINDSTR /V /I /F:spiders.txt ex050523.log > ns0505024.log

spiders.txt would contain a list of strings that match known spiders in your
logfile. That might be agent strings or host addresses. For example, it
might contain the following lines:

googlebot
msnbot
slurp
10.123.45.67

(where 10.123.45.67 is the IP address of a spider, for example).

Note that this approach can have unexpected consequences. If you have a lot
of referrals from a page called slurpy.htm, for example, it would also be
excluded by the reference to the Inktomi spider in the list above.

Aengus

+------------------------------------------------------------------------
|  TO UNSUBSCRIBE from this list:
|    http://lists.meer.net/mailman/listinfo/analog-help
|
|  Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
|  List archives:  http://www.analog.cx/docs/mailing.html#listarchives
+------------------------------------------------------------------------

Reply via email to