In <[EMAIL PROTECTED]>, "Sean M. Burke" <[EMAIL PROTECTED]> writes: > Aside from basic concepts (don't hammer the server; always obey the > robots.txt; don't span hosts unless you are really sure that you want to), > are there any particular bits of wisdom that list members would want me to > pass on to my readers?
Some thoughts: * Implement specifications fully, or at least recognize when your implementation reaches something it doesn't support Examples: Some spiders cannot handle protcol preserving links like <a href="//www.foo.com/">something</a> which is a perfectly valid link that should preserve the current protocol, and instead access http://currentbase//www.foo.com/ * Identify yourself, set appropriate headres Spiders should include a unique name and version number (for robots.txt), and contact information for the author (a _working_ web site or email address) in the user agent string. Sending valid Referer headers is helpful to understand what a robot is doing, too, sending the author's homepage as the referrer usually is not. * Don't make assumptions on the meaning of URLs Example: http://www.foo.com/something and http://www.foo.com/something/ are not necessarily the same, nor is the former required to redirect to the latter. http://www.foo.com/ can return different things depending on parameters of the request, or other conditions (time of day, temperature, mood of the server) -- depending on the application on the spider take variants of the same URL into account. * Cache server responses when cacheable At least locally during a run (I dislike spiders requesting 2000 copies of clear.gif) but preferrably between runs, too (HTTP/1.1 cache control, expires, etag) * Recognize loops (MD5 signatures are your friend, but recognize loops even when the content changes slightly) Example: Appending /something or ?something to a URL often does not make any difference to what a web server returns, all it takes is a relative link on that page on cunstruct an infinite URL chain, like http://www.foo.com/page.html/ http://www.foo.com/page.html/otherpage/ http://www.foo.com/page.html/otherpage/otherpage/ http://www.foo.com/page.html/otherpage/otherpage/otherpage/ http://www.foo.com/page.html/otherpage/otherpage/otherpage/otherpage/ * Expect and handle errors (expect the unexpected :-)) Badly coded content and links are common, expect that code passed to the spider will not be perfect. * Beware of suspicious links Check URLs carefully before following a link, check for fully qualified hostnames etc. Of course spiders are always run off perfectly managed and secured machines -- not. Example: http://localhost/cgi-bin/phf?... http://localhost/default.ida?... http://proxy/ -- Klaus Johannes Rusch [EMAIL PROTECTED] http://www.atmedia.net/KlausRusch/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".