Re: The Future of Nutch, reactivated

AJ Chen Thu, 14 May 2009 11:41:18 -0700

Andrzej, great summary. I played with nutch before for web search engine,
but has not used it for a while because it has become too complicated. based
on my experience in building semantic search engine for healthcare vertical,
it think it would be benefitial to separate crawling from search
architecturaly and focus on just crawling for nutch.


My sense is that, if nutch can make crawling simple and deliver high-quality
crawled contents along with important metadata like link structure, it will
have much better chance to become an indispensable part of search engine. Of
course, it's important to include an implementation for search as well so
that nutch can provide end-to-end (i.e. crawl and search) results for
evaluation.  but, don't get stuck in search because there are a variety of
different search needs, such as static search, dynamic search, real time
search, semantic search, etc. it's not easy to make nutch to meet all of
these real-world needs. rather, nutch should provide the crawled contents in
a way that people can easily apply different search tools or search
technology.

As for the audience, it makes sense to focus on the middle of the usage
spectrum, ie. vertical search or focusd search in mid-range scale. but, I
won't ignore the small projects or developer projects because this is often
the start point for new project evaluation.

-aj
-- 
AJ Chen, PhD
Co-Chair, Semantic Web SIG, sdforum.org
Technical Architect, healthline.com
http://web2express.org
Palo Alto, CA

On Thu, May 14, 2009 at 6:45 AM, Andrzej Bialecki <a...@getopt.org> wrote:

> Hi all,
>
> I'd like to revive this thread and gather additional feedback so that we
> end up with concrete conclusions. Much of what I write below others have
> said before, I'm trying here to express this as it looks from my point of
> view.
>
> Target audience
> ===============
> I think that the Nutch project experiences a crisis of personality now - we
> are not sure what is the target audience, and we cannot satisfy everyone. I
> think that there are following groups of Nutch users:
>
> 1. Large-scale Internet crawl & search: actually, there are only few
> such users, because it takes considerable resources to manage operations
> on that scale. Scalability, manage-ability and ranking/spam prevention are
> the chief concerns here.
>
> 2. Medium-scale vertical search: I suspect that many Nutch users fall into
> this category. Modularity, flexibility in implementing custom processing,
> ability to modify workflows and to use only some Nutch components seem to be
> chief concerns here. Scalability too, but only up to a volume of ~100-200
> mln documents.
>
> 3. Small- to medium-scale enterprise search: there's a sizeable number of
> Nutch users that fall into this category, for historical reasons. Link-based
> ranking and resource discovery are not that important here, but integration
> with Windows networking, Microsoft formats and databases , as well as
> realtime indexing and easy index maintenance are crucial. This class of
> users often has to heavily customize Nutch to get any sensible result. Also,
> this is where Solr really shines, so there is little benefit in using Nutch
> here. I predict that Nutch will have fewer and fewer users of this type.
>
> 4. Single desktop to small intranet search: as above, but the accent is on
> the ease of use out of the box, and an often requested feature is a GUI
> frontend. Currently IMHO Nutch is too complex and requires too much
> command-line operation for casual users to make this use case attractive.
>
> What is the target audience that we as a community want to support? By this
> I mean not only the moral support, but also active participation in the
> development process. From the place where we are at the moment we could go
> in any of the above directions.
>
> Core competence
> ===============
> This is a simple but important point. Currently we maintain several major
> subsystems in Nutch that are implemented by other projects, and often in a
> better way. Plugin framework (and dependency injection) and content parsing
> are two areas that we have to delegate to third-party libraries, such as
> Tika and OSGI or some other simple IOC container - probably there are other
> components that we don't have to do ourselves. Another thing that I'd love
> to delegate is the distributed search and index maintenance - either through
> Solr or Katta or something else.
>
> The question then is, what is the core competence of this project? I see
> the following major areas that are unique to Nutch:
>
> * crawling - this includes crawl scheduling (and re-crawl scheduling),
> discovery and classification of new resources, strategies for crawling
> specific sets of URLs (hosts and domains) under bandwidth and netiquette
> constraints, etc.
>
> * web graph analysis - this includes link-based ranking, mirror detection
> (and URL "aliasing") but also link spam detection and a more complex control
> over the crawling frontier.
>
> Anything more? I'm not sure - perhaps I would add template detection and
> pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).
>
> Nutch 1.0 already made some steps in this direction, with the new link
> analysis package and pluggable FetchSchedule and Signature. A lot remains to
> be done here, and we are still spending a lot of resources on dealing with
> issues outside this core competence.
>
> -------
>
> So, what do we need to do next?
>
> * we need to decide where we should commit our resources, as a community of
> users, contributors and committers, so that the project is most useful to
> our target audience. At this point there are few active committers, so I
> don't think we can cover more than 1 direction at a time ... ;)
>
> * we need to re-architect Nutch to focus on our core competence, and
> delegate what we can to other projects.
>
> Feel free to comment on the above, make suggestions or corrections. I'd
> like to wrap it up in a concise mission statement that would help us set the
> goals for the next couple months.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Re: The Future of Nutch, reactivated

Reply via email to