> I like Rob's idea of an exclusion list specifically formatted for
> interesting sites.  Shades of SiteScooper!

        To add to this, how about parsing the -f argument into a
.$filename[rc] file, so if I decide to use Slashdot, I can then use -f
Slashdot and ~/.plucker/.Slashdot.ex (or some other name) can be used by
default, sort of like a style sheet. Yes, I could use -E, but this should be
somewhat automagic. Just an idea.

> I've been of the opinion that fetching a set of pages from various
> web-sites is not the most interesting part of the parser, and that such
> magic more properly belongs in a system like SiteScooper, which would
> provide a set of pages for Plucker to operate on.

        I've been suggesting we use pavuk for years now to replace the
parser. It does quite a bit more than any other web-fetching application
I've seen to date. It does Javascript (which, as you know is not "parsed",
since it has to be actually executed to run), cookies, https, manages a
cache, and all kinds of other neat advanced features.

> But I'd like to suggest that if we continue to frobify the fetching like
> this that we move that part of the parser logic to a separate file and
> set of classes, and take it out of the Spider.py code. There's another
> thing that makes easier too, the recursive parsing of pages, necessary
> for implementing the <OBJECT> tag.

        If you do this, then we can subclass the parser to handle not only
HTML and text input formats, but we can use it for converting to and from
other formats, as well as grabbing RSS and syndicated feeds, which would
extend the functionality greatly.

        I'm still partial to the language, but there's many of them to deal
with.

        At least it's not lisp! <flame suit on>




[dd]

Reply via email to