Nutch near future - strategic directions

Andrzej Bialecki Mon, 09 Nov 2009 08:24:53 -0800

Hi all,

The ApacheCon is over, our release 1.0 has been out already for sometime, so I think it's a good moment to discuss what are the next stepsin Nutch development.

Let me share with you the topics I identified and presented in theApacheCon slides, and some topics that are worth discussing based onvarious conversations I had there, and the discussions we had on ourmailing list:


1. Avoid duplication of effort
------------------------------

Currently we spend significant effort on implementing functionality thatother projects are dedicated to. Instead of doing the same work, andsometimes poorly, we should concentrate on delegating and reusing:

* Use Tika for content parsing: this will require some effort andcollaboration with the Tika project, to improve Tika's ability to handlemore complex formats well (e.g. hierarchical compound documents such asarchives, mailboxes, RSS), and to contribute any missing parsers (e.g.parse-swf).

* Use Solr for indexing & search: it is hard to justify the effort ofdeveloping and maintaining our own search server - Solr offers much morefunctionality, configurability, performance and ease of integration thanour relatively primitive search server. Our integration with Solr needsto be improved so that it's easier to setup and operate.

* Use database-like storage abstraction: this may seem like a seriousdeparture from the current architecture, but I don't mean that we shouldswitch to an SQL DB ... what this means is that we should provide anoption to use HBase, as well as the current plain MapFile-s (and perhapsother types of DBs, such as Berkeley DB or SQL, if it makes sense) asour storage. There is a very promising initial port of Nutch to HBase,which is currently closely integrated with HBase API (which is both goodand bad) - it provides several improvements over our current storage, soI think it's worth using as the new default, but let's see if we canmake it more abstract.

* Plugins: the initial OSGI port looks good, but I'm not sure yet atthis moment if the benefits of OSGI outweigh the cost of this change ...

* Shard management: this is currently an Achilles' heel of Nutch, whereusers are left on their own ... If we switch to using HBase then atleast on the crawling side the shard management will become much easier.This still leaves the problem of deploying new content to searchserver(s). The candidate framework for this side of the shard managementis Katta + patches provided by Ted Dunning (see ???). If we switch tousing Solr we would have to also use the Katta / Solr integration, andperhaps Solr/Hadoop integration as well. This is a complex mix ofhalf-ready components that needs to be well thought-through ...

* Crawler Commons: during our Crawler MeetUp all representatives agreedthat we should collect a few components that are nearly the same acrossall projects and collaborate on their development, and use them as anexternal dependency. The candidate components are:


 - robots.txt parsing
 - URL filtering and normalization
 - page signature (fingerprint) implementations
 - page template detection & removal (aka. main content extraction)

- possibly others, like URL redirection tracking, PageRankcalculation, crawler trap detection etc.


2. Make Nutch easier to use
---------------------------

This, as you may remember our earlier discussions, begs the question:who is the target audience of Nutch?

In my opinion, the main users of Nutch are vertical search engines, andthis is the audience that we should cater to. There are many reasons forthis:

- Nutch is too complex and too heavy for those that need to crawl up toa few thousand pages. Now that the Droids project exists it's probablynot worth the effort to attempt a complete re-design of Nutch so that itfits the need of this group - Nutch is based on map-reduce, and it's notlikely we would want to change that, so this means there will always bea significant overhead for small jobs. I'm not saying we should not makeNutch easier to use, but for small crawls Nutch is an overkill. Also, inmany cases these users don't realize that they don't do any frontierdiscovery and expansion, and what they really need is Solr.

- at the other end of the spectrum, there are very very few companiesthat want to do a wide large web-scale crawling - this is costly, andrequires a solid business plan and serious funding. These users areprepared anyway to spend significant effort on customizations andproblem-solving, or they may want to use only some parts of Nutch. Oftenthey are also not too eager to contribute back to the project - eitherbecause of their proprietary nature or because their customizations arenot useful for general audience.

The remaining group is interested in medium-size, high quality crawling(focused, with good spam & junk controls). Which is either an enterprisesearch or a vertical search. We should make Nutch an attractive platformfor such users, and we should discuss what this entails. Also, if werefactor Nutch in the way I described above, it will be easier for suchusers to contribute back to Nutch and other related projects.


3. Provide a platform for solving the really interesting issues
---------------------------------------------------------------

Nutch has many bits and pieces that implement really smart algorithmsand heuristics to solve difficult issues that occur in crawling. Theproblem is that they are often well hidden and poorly documented, andtheir interaction with the rest of the system is far from obvious.Sometimes this is related to premature performance optimizations, inother cases this is just a poorly abstracted design. Examples wouldinclude the OPIC scoring, meta-tags & metadata handling, deduplication,redirection handling, etc.

Even though these components are usually implemented as plugins, thislack of transparency and poor design makes it difficult to experimentwith Nutch. I believe that improving this area will result in many moreusers contributing back to the project, both from business and fromacademia.


And there are quite a few interesting challenges to solve:

* crawl scheduling, i.e. determining the order and composition offetchlists to maximize the crawling speed.

* spam & junk detection (I won't go into details on this, there are tonsof literature on the subject)

* crawler trap handling (e.g. the classic calendar page that generatesinfinite number of pages).

* enterprise-specific ranking and scoring. This includes users' feedback(explicit and implicit, e.g. click-throughs)


* pagelet-level crawling (e.g. portals, RSS feeds, discussion fora)

* near-duplicate detection, and closely related issue of extraction ofthe main content from a templated page.

* URL aliasing (e.g. www.a.com == a.com == a.com/index.html ==a.com/default.asp), and what happens with inlinks to such aliased pages.Also related to this is the problem of temporary/permanent redirects andcomplete mirrors.

Etc, etc ... I'm pretty sure there are many others. Let's make Nutch anattractive platform to develop and experiment with such components.


-----------------

Briefly ;) that's what comes to my mind when I think about the future ofNutch. I invite you all to share your thoughts and suggestions!


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Nutch near future - strategic directions

Reply via email to