Re: The Future of Nutch, reactivated

consultas Fri, 15 May 2009 19:33:53 -0700

Keep it simple.

Many people, it seems to me, use nutch to exercise, in some way theirprogramming expertise and talents.I am just a user, and I think that users just want something thant can indexthe web and find results, when they search. I don't want to deal withcomplicated application names, I just want to crawl and search. And, itshould be noted that for most of the users, myself included, it is not atrivial job to get Nutch working, in Linux or Windows.Anyway, I think that the bigest use for Nutch will be for vertical orregional search purposes.By the way, from this point of view i really didn't like the experience withthe original release of version of 1.0. Too slow the crawling phase.

----- Original Message -----From: "Andrzej Bialecki" <a...@getopt.org>

To: <nutch-user@lucene.apache.org>
Sent: Thursday, May 14, 2009 10:45 AM
Subject: The Future of Nutch, reactivated

Hi all,

I'd like to revive this thread and gather additional feedback so that we
end up with concrete conclusions. Much of what I write below others have
said before, I'm trying here to express this as it looks from my point
of view.

Target audience
===============
I think that the Nutch project experiences a crisis of personality now -
we are not sure what is the target audience, and we cannot satisfy
everyone. I think that there are following groups of Nutch users:

1. Large-scale Internet crawl & search: actually, there are only few
such users, because it takes considerable resources to manage operations
on that scale. Scalability, manage-ability and ranking/spam prevention
are the chief concerns here.

2. Medium-scale vertical search: I suspect that many Nutch users fall
into this category. Modularity, flexibility in implementing custom
processing, ability to modify workflows and to use only some Nutch
components seem to be chief concerns here. Scalability too, but only up
to a volume of ~100-200 mln documents.

3. Small- to medium-scale enterprise search: there's a sizeable number
of Nutch users that fall into this category, for historical reasons.
Link-based ranking and resource discovery are not that important here,
but integration with Windows networking, Microsoft formats and databases
, as well as realtime indexing and easy index maintenance are crucial.
This class of users often has to heavily customize Nutch to get any
sensible result. Also, this is where Solr really shines, so there is
little benefit in using Nutch here. I predict that Nutch will have fewer
and fewer users of this type.

4. Single desktop to small intranet search: as above, but the accent is
on the ease of use out of the box, and an often requested feature is a
GUI frontend. Currently IMHO Nutch is too complex and requires too much
command-line operation for casual users to make this use case attractive.

What is the target audience that we as a community want to support? By
this I mean not only the moral support, but also active participation in
the development process. From the place where we are at the moment we
could go in any of the above directions.

Core competence
===============
This is a simple but important point. Currently we maintain several
major subsystems in Nutch that are implemented by other projects, and
often in a better way. Plugin framework (and dependency injection) and
content parsing are two areas that we have to delegate to third-party
libraries, such as Tika and OSGI or some other simple IOC container -
probably there are other components that we don't have to do ourselves.
Another thing that I'd love to delegate is the distributed search and
index maintenance - either through Solr or Katta or something else.

The question then is, what is the core competence of this project? I see
the following major areas that are unique to Nutch:

* crawling - this includes crawl scheduling (and re-crawl scheduling),
discovery and classification of new resources, strategies for crawling
specific sets of URLs (hosts and domains) under bandwidth and netiquette
constraints, etc.

* web graph analysis - this includes link-based ranking, mirror
detection (and URL "aliasing") but also link spam detection and a more
complex control over the crawling frontier.

Anything more? I'm not sure - perhaps I would add template detection and
pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).

Nutch 1.0 already made some steps in this direction, with the new link
analysis package and pluggable FetchSchedule and Signature. A lot
remains to be done here, and we are still spending a lot of resources on
dealing with issues outside this core competence.

-------

So, what do we need to do next?

* we need to decide where we should commit our resources, as a community
of users, contributors and committers, so that the project is most
useful to our target audience. At this point there are few active
committers, so I don't think we can cover more than 1 direction at a
time ... ;)

* we need to re-architect Nutch to focus on our core competence, and
delegate what we can to other projects.

Feel free to comment on the above, make suggestions or corrections. I'd
like to wrap it up in a concise mission statement that would help us set
the goals for the next couple months.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



--------------------------------------------------------------------------------



No virus found in this incoming message.
Checked by AVG - www.avg.com

Version: 8.0.238 / Virus Database: 270.12.29/2114 - Release Date: 05/14/0906:28:00

Re: The Future of Nutch, reactivated

Reply via email to