Stefan Groschupf wrote:
Hi,

I had a talk with some guys from the library that are interested to use nutch to setup a regional internet archive.

You might also want to encourage them to take a look at Heritrix, the open-source java crawler from the Internet Archive, and used (or planned for use) by a number of major libraries.

The emphasis with Heritrix is on archival rather than text indexing/search,
so some ways it is different from Nutch include:

  - stores exact copies of HTTP responses in aggregate 'ARC' files
  - by default, fetches and stores all document types, of all lengths
  - doesn't use the 'rounds'/batching approach

Hopefully Nutch and Heritrix crawling efforts will be able to share
code, results, and solutions to thorny crawling problems as time goes on.

See for more info:

    http://crawler.archive.org

- Gordon @ IA



-------------------------------------------------------
This SF.Net email is sponsored by the new InstallShield X.
From Windows to Linux, servers to mobile, InstallShield X is the one
installation-authoring solution that does it all. Learn more and
evaluate today! http://www.installshield.com/Dev2Dev/0504
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to