The Future of Nutch, reactivated

Andrzej Bialecki Thu, 14 May 2009 06:46:22 -0700

Hi all,

I'd like to revive this thread and gather additional feedback so that we

end up with concrete conclusions. Much of what I write below others havesaid before, I'm trying here to express this as it looks from my pointof view.


Target audience
===============

I think that the Nutch project experiences a crisis of personality now -we are not sure what is the target audience, and we cannot satisfyeveryone. I think that there are following groups of Nutch users:


1. Large-scale Internet crawl & search: actually, there are only few
such users, because it takes considerable resources to manage operations

on that scale. Scalability, manage-ability and ranking/spam preventionare the chief concerns here.

2. Medium-scale vertical search: I suspect that many Nutch users fallinto this category. Modularity, flexibility in implementing customprocessing, ability to modify workflows and to use only some Nutchcomponents seem to be chief concerns here. Scalability too, but only upto a volume of ~100-200 mln documents.

3. Small- to medium-scale enterprise search: there's a sizeable numberof Nutch users that fall into this category, for historical reasons.Link-based ranking and resource discovery are not that important here,but integration with Windows networking, Microsoft formats and databases, as well as realtime indexing and easy index maintenance are crucial.This class of users often has to heavily customize Nutch to get anysensible result. Also, this is where Solr really shines, so there islittle benefit in using Nutch here. I predict that Nutch will have fewerand fewer users of this type.

4. Single desktop to small intranet search: as above, but the accent ison the ease of use out of the box, and an often requested feature is aGUI frontend. Currently IMHO Nutch is too complex and requires too muchcommand-line operation for casual users to make this use case attractive.

What is the target audience that we as a community want to support? Bythis I mean not only the moral support, but also active participation inthe development process. From the place where we are at the moment wecould go in any of the above directions.


Core competence
===============

This is a simple but important point. Currently we maintain severalmajor subsystems in Nutch that are implemented by other projects, andoften in a better way. Plugin framework (and dependency injection) andcontent parsing are two areas that we have to delegate to third-partylibraries, such as Tika and OSGI or some other simple IOC container -probably there are other components that we don't have to do ourselves.Another thing that I'd love to delegate is the distributed search andindex maintenance - either through Solr or Katta or something else.

The question then is, what is the core competence of this project? I seethe following major areas that are unique to Nutch:

* crawling - this includes crawl scheduling (and re-crawl scheduling),discovery and classification of new resources, strategies for crawlingspecific sets of URLs (hosts and domains) under bandwidth and netiquetteconstraints, etc.

* web graph analysis - this includes link-based ranking, mirrordetection (and URL "aliasing") but also link spam detection and a morecomplex control over the crawling frontier.

Anything more? I'm not sure - perhaps I would add template detection andpagelet-level crawling (i.e. sensible re-crawling of portal-type sites).

Nutch 1.0 already made some steps in this direction, with the new linkanalysis package and pluggable FetchSchedule and Signature. A lotremains to be done here, and we are still spending a lot of resources ondealing with issues outside this core competence.


-------

So, what do we need to do next?

* we need to decide where we should commit our resources, as a communityof users, contributors and committers, so that the project is mostuseful to our target audience. At this point there are few activecommitters, so I don't think we can cover more than 1 direction at atime ... ;)

* we need to re-architect Nutch to focus on our core competence, anddelegate what we can to other projects.

Feel free to comment on the above, make suggestions or corrections. I'dlike to wrap it up in a concise mission statement that would help us setthe goals for the next couple months.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

The Future of Nutch, reactivated

Reply via email to