Re: The Future of Nutch, reactivated

Julien Nioche Sat, 23 May 2009 03:47:03 -0700

Hi,

Am joining the conversation a bit late but nevermind...


In my views the main targets should be (2). As you pointed out, SOLR covers
(3) and (4) quite well (or will progressively do so). As for (1), there is
definitely an audience even if it is small but would certainly benefit from
the work done towards (2). As you said, operating on a large scale (i.e
using more than 100 slaves) requires a lot of resources and a dedicated team
and I expect that the people interested in large scale would have their own
views on scoring and spam prevention anyway :-)

I completely agree that there should be as much delegation of
functionalities to third-parties as possible (e.g. parsing with Tika) in
order to focus on the core competences.
I really like your idea of doing template detection for instance. Another
thing I found promising is the HBase integration (NUTCH-650), which would
also allow more interoperability with other tools such as Heritrix and make
the data structure a bit more open.

Talking about future functionalities, we do quite a lot of text analysis
with tools like Gate or UIMA and have been working on things such as
detection of adult content and automatic text classification with Nutch.
There are plenty of interesting things that can be done for vertical search
systems, such as Named Entity Extraction etc... Since NLP applications can
be quite greedy, leveraging Hadoop is definitely an advantage. I'd love to
see in the future versions of Nutch a separation between Format Parsing (i.e
Tika) and content analysis, where implementations would get a
semi-structured representation of the documents a bit like what extensions
of HTML parsers are getting currently, but regardless of the original
format.

Have a good week end

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/5/14 Andrzej Bialecki <a...@getopt.org>

> Hi all,
>
> I'd like to revive this thread and gather additional feedback so that we
> end up with concrete conclusions. Much of what I write below others have
> said before, I'm trying here to express this as it looks from my point of
> view.
>
> Target audience
> ===============
> I think that the Nutch project experiences a crisis of personality now - we
> are not sure what is the target audience, and we cannot satisfy everyone. I
> think that there are following groups of Nutch users:
>
> 1. Large-scale Internet crawl & search: actually, there are only few
> such users, because it takes considerable resources to manage operations
> on that scale. Scalability, manage-ability and ranking/spam prevention are
> the chief concerns here.
>
> 2. Medium-scale vertical search: I suspect that many Nutch users fall into
> this category. Modularity, flexibility in implementing custom processing,
> ability to modify workflows and to use only some Nutch components seem to be
> chief concerns here. Scalability too, but only up to a volume of ~100-200
> mln documents.
>
> 3. Small- to medium-scale enterprise search: there's a sizeable number of
> Nutch users that fall into this category, for historical reasons. Link-based
> ranking and resource discovery are not that important here, but integration
> with Windows networking, Microsoft formats and databases , as well as
> realtime indexing and easy index maintenance are crucial. This class of
> users often has to heavily customize Nutch to get any sensible result. Also,
> this is where Solr really shines, so there is little benefit in using Nutch
> here. I predict that Nutch will have fewer and fewer users of this type.
>
> 4. Single desktop to small intranet search: as above, but the accent is on
> the ease of use out of the box, and an often requested feature is a GUI
> frontend. Currently IMHO Nutch is too complex and requires too much
> command-line operation for casual users to make this use case attractive.
>
> What is the target audience that we as a community want to support? By this
> I mean not only the moral support, but also active participation in the
> development process. From the place where we are at the moment we could go
> in any of the above directions.
>
> Core competence
> ===============
> This is a simple but important point. Currently we maintain several major
> subsystems in Nutch that are implemented by other projects, and often in a
> better way. Plugin framework (and dependency injection) and content parsing
> are two areas that we have to delegate to third-party libraries, such as
> Tika and OSGI or some other simple IOC container - probably there are other
> components that we don't have to do ourselves. Another thing that I'd love
> to delegate is the distributed search and index maintenance - either through
> Solr or Katta or something else.
>
> The question then is, what is the core competence of this project? I see
> the following major areas that are unique to Nutch:
>
> * crawling - this includes crawl scheduling (and re-crawl scheduling),
> discovery and classification of new resources, strategies for crawling
> specific sets of URLs (hosts and domains) under bandwidth and netiquette
> constraints, etc.
>
> * web graph analysis - this includes link-based ranking, mirror detection
> (and URL "aliasing") but also link spam detection and a more complex control
> over the crawling frontier.
>
> Anything more? I'm not sure - perhaps I would add template detection and
> pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).
>
> Nutch 1.0 already made some steps in this direction, with the new link
> analysis package and pluggable FetchSchedule and Signature. A lot remains to
> be done here, and we are still spending a lot of resources on dealing with
> issues outside this core competence.
>
> -------
>
> So, what do we need to do next?
>
> * we need to decide where we should commit our resources, as a community of
> users, contributors and committers, so that the project is most useful to
> our target audience. At this point there are few active committers, so I
> don't think we can cover more than 1 direction at a time ... ;)
>
> * we need to re-architect Nutch to focus on our core competence, and
> delegate what we can to other projects.
>
> Feel free to comment on the above, make suggestions or corrections. I'd
> like to wrap it up in a concise mission statement that would help us set the
> goals for the next couple months.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Re: The Future of Nutch, reactivated

Reply via email to