I've found that image maps, framesets, redirects, funky relative 
links, JavaScript links and dynamic URLs generated from backend 
systems are the main problems with robots.  Also bad HTML on pages so 
the robot gets confused parsing it, such as unclosed <li> tags.

I have written up a checklist for robot developers at 
<http://www.searchtools.com/robots/robot-checklist.html>, you may 
reprint it (if you give me credit) or include the URL to it in your 
book.

Avi

At 2:51 AM -0700 3/7/02, Sean M. Burke wrote:
>Hi all!
>My name is Sean Burke, and I'm writing a book for O'Reilly, which is to
>basically replace the Clinton Wong's now out-of-print /Web Client
>Programming with Perl/.  In my book draft so far, I haven't discussed
>actual recursive spiders (I've only discussed getting a given page, and
>then every page that it links to which is also on the same host), since I
>think that most readers that think they want a recursive spider, really don't.
>But it has been suggested that I cover recursive spiders, just for sake of
>completeness.
>
>Aside from basic concepts (don't hammer the server; always obey the
>robots.txt; don't span hosts unless you are really sure that you want to),
>are there any particular bits of wisdom that list members would want me to
>pass on to my readers?

-- 
Complete Guide to Search Engines for Web Sites and Intranets
    <http://www.searchtools.com>

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

Reply via email to