Re: The Future of Nutch, reactivated

Otis Gospodnetic Sat, 23 May 2009 21:31:19 -0700

Hello,
(I saw the first copy of this email went to nutch-user, but I assume nutch-dev 
was a resend and the right list to follow-up on)


I agree with the list of core competencies.  For example, and I don't know 
where I said/wrote this, but I know I said it a few times before -- I think 
Solr is the future of Nutch's search.  I have a feeling the original Nutch 
search components will die off with time - nobody is working on them, and Solr 
is making great progress.

In my experience, most Nutch users fall under #2.  Most require web-wide 
crawling, but really care about a specific vertical slice.  So that's where I'd 
say the focus should be, theoretically.  I say theoretically because I don't 
think active Nutch developers can really choose a direction if it doesn't match 
their own itches.


Otis 
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Andrzej Bialecki <a...@getopt.org>
> To: nutch-dev@lucene.apache.org
> Sent: Thursday, May 14, 2009 9:59:11 AM
> Subject: The Future of Nutch, reactivated
> 
> Hi all,
> 
> I'd like to revive this thread and gather additional feedback so that we
> end up with concrete conclusions. Much of what I write below others have
> said before, I'm trying here to express this as it looks from my point
> of view.
> 
> Target audience
> ===============
> I think that the Nutch project experiences a crisis of personality now -
> we are not sure what is the target audience, and we cannot satisfy
> everyone. I think that there are following groups of Nutch users:
> 
> 1. Large-scale Internet crawl & search: actually, there are only few
> such users, because it takes considerable resources to manage operations
> on that scale. Scalability, manage-ability and ranking/spam prevention
> are the chief concerns here.
> 
> 2. Medium-scale vertical search: I suspect that many Nutch users fall
> into this category. Modularity, flexibility in implementing custom
> processing, ability to modify workflows and to use only some Nutch
> components seem to be chief concerns here. Scalability too, but only up
> to a volume of ~100-200 mln documents.
> 
> 3. Small- to medium-scale enterprise search: there's a sizeable number
> of Nutch users that fall into this category, for historical reasons.
> Link-based ranking and resource discovery are not that important here,
> but integration with Windows networking, Microsoft formats and databases
> , as well as realtime indexing and easy index maintenance are crucial.
> This class of users often has to heavily customize Nutch to get any
> sensible result. Also, this is where Solr really shines, so there is
> little benefit in using Nutch here. I predict that Nutch will have fewer
> and fewer users of this type.
> 
> 4. Single desktop to small intranet search: as above, but the accent is
> on the ease of use out of the box, and an often requested feature is a
> GUI frontend. Currently IMHO Nutch is too complex and requires too much
> command-line operation for casual users to make this use case attractive.
> 
> What is the target audience that we as a community want to support? By
> this I mean not only the moral support, but also active participation in
> the development process. From the place where we are at the moment we
> could go in any of the above directions.
> 
> Core competence
> ===============
> This is a simple but important point. Currently we maintain several
> major subsystems in Nutch that are implemented by other projects, and
> often in a better way. Plugin framework (and dependency injection) and
> content parsing are two areas that we have to delegate to third-party
> libraries, such as Tika and OSGI or some other simple IOC container -
> probably there are other components that we don't have to do ourselves.
> Another thing that I'd love to delegate is the distributed search and
> index maintenance - either through Solr or Katta or something else.
> 
> The question then is, what is the core competence of this project? I see
> the following major areas that are unique to Nutch:
> 
> * crawling - this includes crawl scheduling (and re-crawl scheduling),
> discovery and classification of new resources, strategies for crawling
> specific sets of URLs (hosts and domains) under bandwidth and netiquette
> constraints, etc.
> 
> * web graph analysis - this includes link-based ranking, mirror
> detection (and URL "aliasing") but also link spam detection and a more
> complex control over the crawling frontier.
> 
> Anything more? I'm not sure - perhaps I would add template detection and
> pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).
> 
> Nutch 1.0 already made some steps in this direction, with the new link
> analysis package and pluggable FetchSchedule and Signature. A lot
> remains to be done here, and we are still spending a lot of resources on
> dealing with issues outside this core competence.
> 
> -------
> 
> So, what do we need to do next?
> 
> * we need to decide where we should commit our resources, as a community
> of users, contributors and committers, so that the project is most
> useful to our target audience. At this point there are few active
> committers, so I don't think we can cover more than 1 direction at a
> time ... ;)
> 
> * we need to re-architect Nutch to focus on our core competence, and
> delegate what we can to other projects.
> 
> Feel free to comment on the above, make suggestions or corrections. I'd
> like to wrap it up in a concise mission statement that would help us set
> the goals for the next couple months.
> 
> -- Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com

Re: The Future of Nutch, reactivated

Reply via email to