Hello, (I saw the first copy of this email went to nutch-user, but I assume nutch-dev was a resend and the right list to follow-up on)
I agree with the list of core competencies. For example, and I don't know where I said/wrote this, but I know I said it a few times before -- I think Solr is the future of Nutch's search. I have a feeling the original Nutch search components will die off with time - nobody is working on them, and Solr is making great progress. In my experience, most Nutch users fall under #2. Most require web-wide crawling, but really care about a specific vertical slice. So that's where I'd say the focus should be, theoretically. I say theoretically because I don't think active Nutch developers can really choose a direction if it doesn't match their own itches. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Andrzej Bialecki <a...@getopt.org> > To: nutch-dev@lucene.apache.org > Sent: Thursday, May 14, 2009 9:59:11 AM > Subject: The Future of Nutch, reactivated > > Hi all, > > I'd like to revive this thread and gather additional feedback so that we > end up with concrete conclusions. Much of what I write below others have > said before, I'm trying here to express this as it looks from my point > of view. > > Target audience > =============== > I think that the Nutch project experiences a crisis of personality now - > we are not sure what is the target audience, and we cannot satisfy > everyone. I think that there are following groups of Nutch users: > > 1. Large-scale Internet crawl & search: actually, there are only few > such users, because it takes considerable resources to manage operations > on that scale. Scalability, manage-ability and ranking/spam prevention > are the chief concerns here. > > 2. Medium-scale vertical search: I suspect that many Nutch users fall > into this category. Modularity, flexibility in implementing custom > processing, ability to modify workflows and to use only some Nutch > components seem to be chief concerns here. Scalability too, but only up > to a volume of ~100-200 mln documents. > > 3. Small- to medium-scale enterprise search: there's a sizeable number > of Nutch users that fall into this category, for historical reasons. > Link-based ranking and resource discovery are not that important here, > but integration with Windows networking, Microsoft formats and databases > , as well as realtime indexing and easy index maintenance are crucial. > This class of users often has to heavily customize Nutch to get any > sensible result. Also, this is where Solr really shines, so there is > little benefit in using Nutch here. I predict that Nutch will have fewer > and fewer users of this type. > > 4. Single desktop to small intranet search: as above, but the accent is > on the ease of use out of the box, and an often requested feature is a > GUI frontend. Currently IMHO Nutch is too complex and requires too much > command-line operation for casual users to make this use case attractive. > > What is the target audience that we as a community want to support? By > this I mean not only the moral support, but also active participation in > the development process. From the place where we are at the moment we > could go in any of the above directions. > > Core competence > =============== > This is a simple but important point. Currently we maintain several > major subsystems in Nutch that are implemented by other projects, and > often in a better way. Plugin framework (and dependency injection) and > content parsing are two areas that we have to delegate to third-party > libraries, such as Tika and OSGI or some other simple IOC container - > probably there are other components that we don't have to do ourselves. > Another thing that I'd love to delegate is the distributed search and > index maintenance - either through Solr or Katta or something else. > > The question then is, what is the core competence of this project? I see > the following major areas that are unique to Nutch: > > * crawling - this includes crawl scheduling (and re-crawl scheduling), > discovery and classification of new resources, strategies for crawling > specific sets of URLs (hosts and domains) under bandwidth and netiquette > constraints, etc. > > * web graph analysis - this includes link-based ranking, mirror > detection (and URL "aliasing") but also link spam detection and a more > complex control over the crawling frontier. > > Anything more? I'm not sure - perhaps I would add template detection and > pagelet-level crawling (i.e. sensible re-crawling of portal-type sites). > > Nutch 1.0 already made some steps in this direction, with the new link > analysis package and pluggable FetchSchedule and Signature. A lot > remains to be done here, and we are still spending a lot of resources on > dealing with issues outside this core competence. > > ------- > > So, what do we need to do next? > > * we need to decide where we should commit our resources, as a community > of users, contributors and committers, so that the project is most > useful to our target audience. At this point there are few active > committers, so I don't think we can cover more than 1 direction at a > time ... ;) > > * we need to re-architect Nutch to focus on our core competence, and > delegate what we can to other projects. > > Feel free to comment on the above, make suggestions or corrections. I'd > like to wrap it up in a concise mission statement that would help us set > the goals for the next couple months. > > -- Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com