I like Rob's idea of an exclusion list specifically formatted for
interesting sites.  Shades of SiteScooper!

I've been of the opinion that fetching a set of pages from various
web-sites is not the most interesting part of the parser, and that
such magic more properly belongs in a system like SiteScooper, which
would provide a set of pages for Plucker to operate on.

But I'd like to suggest that if we continue to frobify the fetching
like this that we move that part of the parser logic to a separate
file and set of classes, and take it out of the Spider.py code.
There's another thing that makes easier too, the recursive parsing of
pages, necessary for implementing the <OBJECT> tag.

Bill


Reply via email to