Re: [Nutch-dev] content archiving

Gordon Mohr (Internet Archive) Thu, 03 Jun 2004 14:24:36 -0700

Stefan Groschupf wrote:

Hi,
I had a talk with some guys from the library that are interested to use nutch to setup a regional internet archive.


You might also want to encourage them to take a look at Heritrix, the
open-source java crawler from the Internet Archive, and used (or planned
for use) by a number of major libraries.

The emphasis with Heritrix is on archival rather than text indexing/search,
so some ways it is different from Nutch include:

  - stores exact copies of HTTP responses in aggregate 'ARC' files
  - by default, fetches and stores all document types, of all lengths
  - doesn't use the 'rounds'/batching approach

Hopefully Nutch and Heritrix crawling efforts will be able to share
code, results, and solutions to thorny crawling problems as time goes on.

See for more info:

    http://crawler.archive.org

- Gordon @ IA

-------------------------------------------------------
This SF.Net email is sponsored by the new InstallShield X.

From Windows to Linux, servers to mobile, InstallShield X is the one

installation-authoring solution that does it all. Learn more and
evaluate today! http://www.installshield.com/Dev2Dev/0504
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] content archiving

Reply via email to