Hi Sean,
You might want to consider exploring the "not yet approved" updated
robots.txt standard that covers allow rules and how to apply them to your
spider. This may help raise the level of awareness on the robots.txt
standard. You could also talk about how to use the robots.txt with your
spider and the issues surrounding it.
You might want to talk about how to identify your spider to the web server.
There is really no standard out their, but the one that seems to be used the
most does make the most sense. For example:
UserAgent/1.0 (www.myspider.com/infopage.html; [EMAIL PROTECTED];
other info)
This type of a user agent string can be easily parsed by other software
looking at the User Agent information. It's simple. The user agent name (the
spider), followed by a separator and a version number. Then a space followed
by an open parenthesis with a semicolon separated list of information and a
closing parenthesis. Since no standard is in place for something like this,
it might be nice to include it in your book. You might even stipulate that
the first piece of information in the list be the web site url followed by
an email address.
Kind Regards,
Mike
Michael D. Lange
Website Management Tools, Inc.
Maximize Your Online Visibility
www.websitemanagementtools.com <http://www.websitemanagementtools.com>
(US) 678-714-0279
-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Sean M. Burke
Sent: Thursday, March 07, 2002 4:51 AM
To: [EMAIL PROTECTED]
Subject: [Robots] Perl and LWP robots
Hi all!
My name is Sean Burke, and I'm writing a book for O'Reilly, which is to
basically replace the Clinton Wong's now out-of-print /Web Client
Programming with Perl/. In my book draft so far, I haven't discussed
actual recursive spiders (I've only discussed getting a given page, and
then every page that it links to which is also on the same host), since I
think that most readers that think they want a recursive spider, really
don't.
But it has been suggested that I cover recursive spiders, just for sake of
completeness.
Aside from basic concepts (don't hammer the server; always obey the
robots.txt; don't span hosts unless you are really sure that you want to),
are there any particular bits of wisdom that list members would want me to
pass on to my readers?
--
Sean M. Burke [EMAIL PROTECTED] http://www.spinn.net/~sburke/
--
This message was sent by the Internet robots and spiders discussion list
([EMAIL PROTECTED]). For list server commands, send "help" in the body of
a message to "[EMAIL PROTECTED]".
--
This message was sent by the Internet robots and spiders discussion list
([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message
to "[EMAIL PROTECTED]".