The primary aim of Heritrix is to be an "archival crawler" -- obtaining complete, accurate, deep copies of websites. This includes getting graphical and other non-textual content. Resources are stored exactly as they were received -- no truncation, encoding changes, header changes, etc.
Recrawls of the same URLs do not replace prior crawls in any sort of running page database.
The focus has usually been dozens to hundreds of chosen websites, but has begun to shift to tens of thousands or hundreds of thousands of websites (entire national domains).
Crawls are launched, monitored, and adjusted via a (fairly complex) web user interface, allowing flexible (and sometimes downright idiosyncratic) definition of what URLs should be visited and which should not.
My understanding is that with the Nutch crawler's alternate aims, some of the things it does differently are: - only retrieves and saves indexable content - may truncate or format-shift content as needed - saves content into a database format optimized for later indexing; refetches replace older fetches - run and controlled from a command-line - emphasizes volume of collection under default conditions, rather than exact satisfaction of custom parameters
I'm not up-to-date on the Nutch crawler, I could be missing important features or distinctions.
It'd be nice to converge parts of the crawlers' architectures, for example to share link-extraction or trap-detection techniques.
- Gordon @ Internet Archive
Nutch Crawler wrote:
Hello,
Can someone please contrast and compare the Heritrix crawler with the nutch crawler. What are the advantages/disadvantages to using one over the other?
Thanks, Ralph
------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
