Aside from basic concepts (don't hammer the server; always obey the
robots.txt; don't span hosts unless you are really sure that you want to),
are there any particular bits of wisdom that list members would want me to
pass on to my readers?
Look at
Excellent. I have a copy of Wong's book at home and like that topic
(i.e. I'm a potential customer :)) When will it be published?
I think lots of people do want to know about recursive spiders, and I
bet one of the most frequent obstacles are issues like: queueing, depth
vs. breadth first
Hi Sean,
You might want to consider exploring the not yet approved updated
robots.txt standard that covers allow rules and how to apply them to your
spider. This may help raise the level of awareness on the robots.txt
standard. You could also talk about how to use the robots.txt with your
That's a curious remark about readers and their misplaced desire for
recursive spiders.
A recursive spider allows its user to drill down into a particular
information domain and
ultimately exhaust it if the spider is capable enough. This is of
enormous benefit to the
information researcher
I've found that image maps, framesets, redirects, funky relative
links, JavaScript links and dynamic URLs generated from backend
systems are the main problems with robots. Also bad HTML on pages so
the robot gets confused parsing it, such as unclosed li tags.
I have written up a checklist
In [EMAIL PROTECTED], Sean M. Burke
[EMAIL PROTECTED] writes:
Aside from basic concepts (don't hammer the server; always obey the
robots.txt; don't span hosts unless you are really sure that you want to),
are there any particular bits of wisdom that list members would want me to
pass on to