In <[EMAIL PROTECTED]>, "Sean M. Burke"
<[EMAIL PROTECTED]> writes:
> Aside from basic concepts (don't hammer the server; always obey the
> robots.txt; don't span hosts unless you are really sure that you want to),
> are there any particular bits of wisdom that list members would want me to
> pass on to my readers?
Some thoughts:
* Implement specifications fully, or at least recognize when your
implementation reaches something it doesn't support
Examples:
Some spiders cannot handle protcol preserving links like
<a href="//www.foo.com/">something</a>
which is a perfectly valid link that should preserve the current protocol, and
instead access http://currentbase//www.foo.com/
* Identify yourself, set appropriate headres
Spiders should include a unique name and version number (for robots.txt), and
contact information for the author (a _working_ web site or email address) in
the user agent string.
Sending valid Referer headers is helpful to understand what a robot is doing,
too, sending the author's homepage as the referrer usually is not.
* Don't make assumptions on the meaning of URLs
Example:
http://www.foo.com/something and http://www.foo.com/something/ are not
necessarily the same, nor is the former required to redirect to the latter.
http://www.foo.com/ can return different things depending on parameters of the
request, or other conditions (time of day, temperature, mood of the server) --
depending on the application on the spider take variants of the same URL into
account.
* Cache server responses when cacheable
At least locally during a run (I dislike spiders requesting 2000 copies of
clear.gif) but preferrably between runs, too (HTTP/1.1 cache control, expires,
etag)
* Recognize loops (MD5 signatures are your friend, but recognize loops even
when the content changes slightly)
Example:
Appending /something or ?something to a URL often does not make any difference
to what a web server returns, all it takes is a relative link on that page on
cunstruct an infinite URL chain, like
http://www.foo.com/page.html/
http://www.foo.com/page.html/otherpage/
http://www.foo.com/page.html/otherpage/otherpage/
http://www.foo.com/page.html/otherpage/otherpage/otherpage/
http://www.foo.com/page.html/otherpage/otherpage/otherpage/otherpage/
* Expect and handle errors (expect the unexpected :-))
Badly coded content and links are common, expect that code passed to the spider
will not be perfect.
* Beware of suspicious links
Check URLs carefully before following a link, check for fully qualified
hostnames etc. Of course spiders are always run off perfectly managed and
secured machines -- not.
Example:
http://localhost/cgi-bin/phf?...
http://localhost/default.ida?...
http://proxy/
--
Klaus Johannes Rusch
[EMAIL PROTECTED]
http://www.atmedia.net/KlausRusch/
--
This message was sent by the Internet robots and spiders discussion list
([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message
to "[EMAIL PROTECTED]".