On Friday, February 11, 2005 7:50 PM [GMT], Ken Schweigert <[EMAIL PROTECTED]> wrote:
> On Fri, Feb 11, 2005 at 07:21:00PM -0000, Aengus wrote: >> On Friday, February 11, 2005 7:14 PM [GMT], >> Ken Schweigert <[EMAIL PROTECTED]> wrote: >> >>> Or ... to regenerate it at your convenience: >>> >>> >>> [EMAIL PROTECTED] tmp]$ wget http://www.robotstxt.org/wc/active/all.txt >>> [EMAIL PROTECTED] tmp]$ grep "robot-name:" all.txt | awk -F: '{print $2}' | >>> sed 's/^ *//g' | sort | awk '{print "ROBOTINCLUDE \"" $1 "*\""}' >> >> grep "robot-name:" or grep "robot-useragent:"? > > I used robot-name because there were entries for robot-useragent that > had stuff like: > > robot-useragent: Due to a deficiency in Java it's not > currently possible to set the User-Agent. robot-useragent:None > robot-useragent: no > robot-useragent: > > This kind of messed up the list and using robot-name produces a list > more like Jeremy's. Maybe he can chime in and let us know the correct > way. Analog is looking at the UserAgent field in the log file. If the robot doesn't set the User-Agent string (whether because of a deficiency in Java or some other reason), then Analog can't mark it out as a Robot, so ROBOTINCLUDE "Acme.Spider*" won't achieve anything. If the list is correct, and "Walhello appie" actually sets a UserAgent of "appie", then ROBOTINCLUDE "Walhello appie*" won't work either. So assuming the information in the list is correct, robot-useragent: would be the correct field to use. If the list isn't 100% reliable, then I'd still be inclined to go with the UserAgent field, but probably crop any version information that might be included. In many cases you come up with the same result, but in the cases where you get different results, the UserAgent is more likely to be what you're looking for. For example, ROBOTINCLDE "wired-digital-newsbot/*" seems much more likely to do the job that ROBOTINCLUDE "Wired Digital* " (just to pick an example at random). Aengus +------------------------------------------------------------------------ | TO UNSUBSCRIBE from this list: | http://lists.meer.net/mailman/listinfo/analog-help | | Usenet version: news://news.gmane.org/gmane.comp.web.analog.general | List archives: http://www.analog.cx/docs/mailing.html#listarchives +------------------------------------------------------------------------