Hi all,

I'd like to revive this thread and gather additional feedback so that we
end up with concrete conclusions. Much of what I write below others have said before, I'm trying here to express this as it looks from my point of view.

Target audience
===============
I think that the Nutch project experiences a crisis of personality now - we are not sure what is the target audience, and we cannot satisfy everyone. I think that there are following groups of Nutch users:

1. Large-scale Internet crawl & search: actually, there are only few
such users, because it takes considerable resources to manage operations
on that scale. Scalability, manage-ability and ranking/spam prevention are the chief concerns here.

2. Medium-scale vertical search: I suspect that many Nutch users fall into this category. Modularity, flexibility in implementing custom processing, ability to modify workflows and to use only some Nutch components seem to be chief concerns here. Scalability too, but only up to a volume of ~100-200 mln documents.

3. Small- to medium-scale enterprise search: there's a sizeable number of Nutch users that fall into this category, for historical reasons. Link-based ranking and resource discovery are not that important here, but integration with Windows networking, Microsoft formats and databases , as well as realtime indexing and easy index maintenance are crucial. This class of users often has to heavily customize Nutch to get any sensible result. Also, this is where Solr really shines, so there is little benefit in using Nutch here. I predict that Nutch will have fewer and fewer users of this type.

4. Single desktop to small intranet search: as above, but the accent is on the ease of use out of the box, and an often requested feature is a GUI frontend. Currently IMHO Nutch is too complex and requires too much command-line operation for casual users to make this use case attractive.

What is the target audience that we as a community want to support? By this I mean not only the moral support, but also active participation in the development process. From the place where we are at the moment we could go in any of the above directions.

Core competence
===============
This is a simple but important point. Currently we maintain several major subsystems in Nutch that are implemented by other projects, and often in a better way. Plugin framework (and dependency injection) and content parsing are two areas that we have to delegate to third-party libraries, such as Tika and OSGI or some other simple IOC container - probably there are other components that we don't have to do ourselves. Another thing that I'd love to delegate is the distributed search and index maintenance - either through Solr or Katta or something else.

The question then is, what is the core competence of this project? I see the following major areas that are unique to Nutch:

* crawling - this includes crawl scheduling (and re-crawl scheduling), discovery and classification of new resources, strategies for crawling specific sets of URLs (hosts and domains) under bandwidth and netiquette constraints, etc.

* web graph analysis - this includes link-based ranking, mirror detection (and URL "aliasing") but also link spam detection and a more complex control over the crawling frontier.

Anything more? I'm not sure - perhaps I would add template detection and pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).

Nutch 1.0 already made some steps in this direction, with the new link analysis package and pluggable FetchSchedule and Signature. A lot remains to be done here, and we are still spending a lot of resources on dealing with issues outside this core competence.

-------

So, what do we need to do next?

* we need to decide where we should commit our resources, as a community of users, contributors and committers, so that the project is most useful to our target audience. At this point there are few active committers, so I don't think we can cover more than 1 direction at a time ... ;)

* we need to re-architect Nutch to focus on our core competence, and delegate what we can to other projects.

Feel free to comment on the above, make suggestions or corrections. I'd like to wrap it up in a concise mission statement that would help us set the goals for the next couple months.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to