On Friday, February 11, 2005 7:50 PM [GMT],
Ken Schweigert <[EMAIL PROTECTED]> wrote:

> On Fri, Feb 11, 2005 at 07:21:00PM -0000, Aengus wrote:
>> On Friday, February 11, 2005 7:14 PM [GMT],
>> Ken Schweigert <[EMAIL PROTECTED]> wrote:
>>
>>> Or ... to regenerate it at your convenience:
>>>
>>>
>>> [EMAIL PROTECTED] tmp]$ wget http://www.robotstxt.org/wc/active/all.txt
>>> [EMAIL PROTECTED] tmp]$ grep "robot-name:" all.txt | awk -F: '{print $2}' |
>>> sed 's/^ *//g' | sort | awk '{print "ROBOTINCLUDE \"" $1 "*\""}'
>>
>> grep "robot-name:" or grep "robot-useragent:"?
>
> I used robot-name because there were entries for robot-useragent that
> had stuff like:
>
> robot-useragent:                Due to a deficiency in Java it's not
> currently possible to set the User-Agent. robot-useragent:None
> robot-useragent: no
> robot-useragent:
>
> This kind of messed up the list and using robot-name produces a list
> more like Jeremy's.  Maybe he can chime in and let us know the correct
> way.

Analog is looking at the UserAgent field in the log file. If the robot
doesn't set the User-Agent string (whether because of a deficiency in
Java or some other reason), then Analog can't mark it out as a Robot, so
ROBOTINCLUDE  "Acme.Spider*" won't achieve anything.

If the list is correct, and "Walhello appie" actually sets a UserAgent
of "appie", then ROBOTINCLUDE "Walhello appie*" won't work either.

So assuming the information in the list is correct, robot-useragent:
would be the correct field to use. If the list isn't 100% reliable, then
I'd still be inclined to go with the UserAgent field, but probably crop
any version information that might be included. In many cases you come
up with the same result, but in the cases where you get different
results, the UserAgent is more likely to be what you're looking for.

For example, ROBOTINCLDE "wired-digital-newsbot/*" seems much more
likely to do the job that ROBOTINCLUDE "Wired Digital* " (just to pick
an example at random).

Aengus

+------------------------------------------------------------------------
|  TO UNSUBSCRIBE from this list:
|    http://lists.meer.net/mailman/listinfo/analog-help
|
|  Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
|  List archives:  http://www.analog.cx/docs/mailing.html#listarchives
+------------------------------------------------------------------------

Reply via email to