The primary aim of Heritrix is to be an "archival crawler" --
obtaining complete, accurate, deep copies of websites. This
includes getting graphical and other non-textual content.
Resources are stored exactly as they were received -- no
truncation, encoding changes, header changes, etc.

Recrawls of the same URLs do not replace prior crawls in any
sort of running page database.

The focus has usually been dozens to hundreds of chosen
websites, but has begun to shift to tens of thousands or
hundreds of thousands of websites (entire national domains).

Crawls are launched, monitored, and adjusted via a (fairly
complex) web user interface, allowing flexible (and sometimes
downright idiosyncratic) definition of what URLs should be
visited and which should not.

My understanding is that with the Nutch crawler's alternate
aims, some of the things it does differently are:
 - only retrieves and saves indexable content
 - may truncate or format-shift content as needed
 - saves content into a database format optimized for
   later indexing; refetches replace older fetches
 - run and controlled from a command-line
 - emphasizes volume of collection under default conditions,
   rather than exact satisfaction of custom parameters

I'm not up-to-date on the Nutch crawler, I could be missing
important features or distinctions.

It'd be nice to converge parts of the crawlers' architectures,
for example to share link-extraction or trap-detection
techniques.

- Gordon @ Internet Archive

Nutch Crawler wrote:
Hello,

Can someone please contrast and compare the Heritrix crawler
with the nutch crawler. What are the advantages/disadvantages
to using one over the other?

Thanks,
Ralph


------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers



------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to