Aaron Binns wrote:

Our usage of Nutch is focused on index building and search services.  We
don't use the crawling/fetching features at all.  We use Heritrix.
Typically, our large-scale harvests are performed over 8-12 week
periods, then the archived data is handed off to me for full-text search
indexing.  We deploy the indexes on a separate rack of machines
dedicated to hosting the full-text search service.

One of the biggest boons of Nutch is the Hadoop infrastructure.  When
indexing massive data sets, being able to fire up 60+ nodes in a Hadoop
system helps tremendously.

Are you familiar with the distributed indexing package in Hadoop contrib/ ?


However, the one of the biggest challenges to using Nutch is the fact
that the URL is used as the unique key for a document.  This is usually
a sensible thing to do, but for web archives, it doesn't work.  Our
NutchWAX package contains all sorts of hacks to work around this
assumption.

Indeed, this change is something that I've been considering, too - URL==page doesn't work that well in case of archives, but also when your unit of information is smaller (pagelet) or larger (compound docs) than a page.

People can help with this by working on a patch that replaces this silent assumption with an explicit API, i.e. splitting recordId and URL into separate fields.



As for the future of Nutch, I am concerned over what I see to be an
increasing focus on crawling and fetching.  We have only lightly
evaluated other Open Source search projects, such as Solr, and are not
convinced any can be a drop-in replacement for Nutch.  It looks like
Solr has some nice features for certain, I'm just not convinced it can
scale up to the billion document level.

What do you see as the unique strength of Nutch, then? IMHO there are existing frameworks for distributed indexing (on Hadoop) and distributed search (e.g. Katta). We would like to avoid the duplication of effort, and to focus instead on the aspects of Nutch functionality that are not available elsewhere.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to