In <[EMAIL PROTECTED]>, "Sean M. Burke" 
<[EMAIL PROTECTED]> writes:
> Aside from basic concepts (don't hammer the server; always obey the
> robots.txt; don't span hosts unless you are really sure that you want to),
> are there any particular bits of wisdom that list members would want me to
> pass on to my readers?

Some thoughts:


* Implement specifications fully, or at least recognize when your 
implementation reaches something it doesn't support

Examples:

Some spiders cannot handle protcol preserving links like
        <a href="//www.foo.com/">something</a>
which is a perfectly valid link that should preserve the current protocol, and 
instead access http://currentbase//www.foo.com/


* Identify yourself, set appropriate headres

Spiders should include a unique name and version number (for robots.txt), and 
contact information for the author (a _working_ web site or email address) in 
the user agent string.

Sending valid Referer headers is helpful to understand what a robot is doing, 
too, sending the author's homepage as the referrer usually is not.


* Don't make assumptions on the meaning of URLs

Example:

http://www.foo.com/something and http://www.foo.com/something/ are not 
necessarily the same, nor is the former required to redirect to the latter.

http://www.foo.com/ can return different things depending on parameters of the 
request, or other conditions (time of day, temperature, mood of the server) -- 
depending on the application on the spider take variants of the same URL into 
account.


* Cache server responses when cacheable

At least locally during a run (I dislike spiders requesting 2000 copies of
clear.gif) but preferrably between runs, too (HTTP/1.1 cache control, expires, 
etag)


* Recognize loops (MD5 signatures are your friend, but recognize loops even 
when the content changes slightly)

Example:

Appending /something or ?something to a URL often does not make any difference
to what a web server returns, all it takes is a relative link on that page on 
cunstruct an infinite URL chain, like

http://www.foo.com/page.html/
http://www.foo.com/page.html/otherpage/
http://www.foo.com/page.html/otherpage/otherpage/
http://www.foo.com/page.html/otherpage/otherpage/otherpage/
http://www.foo.com/page.html/otherpage/otherpage/otherpage/otherpage/


* Expect and handle errors (expect the unexpected :-))

Badly coded content and links are common, expect that code passed to the spider
will not be perfect.


* Beware of suspicious links

Check URLs carefully before following a link, check for fully qualified 
hostnames etc.   Of course spiders are always run off perfectly managed and 
secured machines -- not.

Example:

http://localhost/cgi-bin/phf?...
http://localhost/default.ida?...
http://proxy/




-- 
Klaus Johannes Rusch
[EMAIL PROTECTED]
http://www.atmedia.net/KlausRusch/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

Reply via email to