Stefan Groschupf wrote:
Hi,
I had a talk with some guys from the library that are interested to use
nutch to setup a regional internet archive.
You might also want to encourage them to take a look at Heritrix, the
open-source java crawler from the Internet Archive, and used (or planned
for use) by a number of major libraries.
The emphasis with Heritrix is on archival rather than text indexing/search,
so some ways it is different from Nutch include:
- stores exact copies of HTTP responses in aggregate 'ARC' files
- by default, fetches and stores all document types, of all lengths
- doesn't use the 'rounds'/batching approach
Hopefully Nutch and Heritrix crawling efforts will be able to share
code, results, and solutions to thorny crawling problems as time goes on.
See for more info:
http://crawler.archive.org
- Gordon @ IA
-------------------------------------------------------
This SF.Net email is sponsored by the new InstallShield X.
From Windows to Linux, servers to mobile, InstallShield X is the one
installation-authoring solution that does it all. Learn more and
evaluate today! http://www.installshield.com/Dev2Dev/0504
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers