> I am looking for a spider/gatherer with the following characteristics:
>     * Enables the control of the crawling process by URL substring/regexp
> and HTML context of the link.
>     * Enables the control of the gathering (i.e. saving) processes by URL
> substring/regexp, MIME type, other header information and ideally by some
> predicates on the HTML source.
>     * Some way to save page/document metadata, ideally in a database.
>     * Freeware, shareware or otherwise inexpensive would be nice.

You might like to take a look at Harvest-NG, which is free software.
(http://webharvest.sourceforge.net/ng) It will allow all of what you
detail above. It saves the metadata in a Perl DBM database - some work
has been done, but not completed, on working with the DBI interface
to a remote database. You may find that some knowledge of Perl is helpful
in adapting it exactly to your needs (much use is made of Perl regular
expressions in the pattern matching, for instance).

Cheers,

Simon.

Reply via email to