Look into Pavuk, webBase, wget, or Larbin. They might satisfy most of
the requirements (though not all) mentioned here.
Hope it helps,
Krishna Jha
Bhasha Inc
PS: actually, this is a repeat of an earlier posting; Nick - is it time
for a InternetRobots FAQ here ?
Jim MacDiarmid wrote:
Is there anything like this that would run on a Windows 98 or NT platform?
Jim MacDiarmid, Senior Software Engineer
PACEL Corp.
8870 Rixlew Lane
Manassas, VA 20109
(703) 257-4759
FAX: (703) 361-6706
www.pacel.com
-Original Message-
From: Simon Wilkinson [SMTP:[EMAIL PROTECTED]]
Sent: Sunday, January 14, 2001 4:37 PM
To: [EMAIL PROTECTED]
Subject: Re: Looking for a gatherer.
I am looking for a spider/gatherer with the following characteristics:
* Enables the control of the crawling process by URL
substring/regexp
and HTML context of the link.
* Enables the control of the gathering (i.e. saving) processes by
URL
substring/regexp, MIME type, other header information and ideally by
some
predicates on the HTML source.
* Some way to save page/document metadata, ideally in a database.
* Freeware, shareware or otherwise inexpensive would be nice.
You might like to take a look at Harvest-NG, which is free software.
(http://webharvest.sourceforge.net/ng) It will allow all of what you
detail above. It saves the metadata in a Perl DBM database - some work
has been done, but not completed, on working with the DBI interface
to a remote database. You may find that some knowledge of Perl is helpful
in adapting it exactly to your needs (much use is made of Perl regular
expressions in the pattern matching, for instance).
Cheers,
Simon.