Hi all,

The ApacheCon is over, our release 1.0 has been out already for some time, so I think it's a good moment to discuss what are the next steps in Nutch development.

Let me share with you the topics I identified and presented in the ApacheCon slides, and some topics that are worth discussing based on various conversations I had there, and the discussions we had on our mailing list:

1. Avoid duplication of effort
------------------------------
Currently we spend significant effort on implementing functionality that other projects are dedicated to. Instead of doing the same work, and sometimes poorly, we should concentrate on delegating and reusing:

* Use Tika for content parsing: this will require some effort and collaboration with the Tika project, to improve Tika's ability to handle more complex formats well (e.g. hierarchical compound documents such as archives, mailboxes, RSS), and to contribute any missing parsers (e.g. parse-swf).

* Use Solr for indexing & search: it is hard to justify the effort of developing and maintaining our own search server - Solr offers much more functionality, configurability, performance and ease of integration than our relatively primitive search server. Our integration with Solr needs to be improved so that it's easier to setup and operate.

* Use database-like storage abstraction: this may seem like a serious departure from the current architecture, but I don't mean that we should switch to an SQL DB ... what this means is that we should provide an option to use HBase, as well as the current plain MapFile-s (and perhaps other types of DBs, such as Berkeley DB or SQL, if it makes sense) as our storage. There is a very promising initial port of Nutch to HBase, which is currently closely integrated with HBase API (which is both good and bad) - it provides several improvements over our current storage, so I think it's worth using as the new default, but let's see if we can make it more abstract.

* Plugins: the initial OSGI port looks good, but I'm not sure yet at this moment if the benefits of OSGI outweigh the cost of this change ...

* Shard management: this is currently an Achilles' heel of Nutch, where users are left on their own ... If we switch to using HBase then at least on the crawling side the shard management will become much easier. This still leaves the problem of deploying new content to search server(s). The candidate framework for this side of the shard management is Katta + patches provided by Ted Dunning (see ???). If we switch to using Solr we would have to also use the Katta / Solr integration, and perhaps Solr/Hadoop integration as well. This is a complex mix of half-ready components that needs to be well thought-through ...

* Crawler Commons: during our Crawler MeetUp all representatives agreed that we should collect a few components that are nearly the same across all projects and collaborate on their development, and use them as an external dependency. The candidate components are:

 - robots.txt parsing
 - URL filtering and normalization
 - page signature (fingerprint) implementations
 - page template detection & removal (aka. main content extraction)
- possibly others, like URL redirection tracking, PageRank calculation, crawler trap detection etc.

2. Make Nutch easier to use
---------------------------
This, as you may remember our earlier discussions, begs the question: who is the target audience of Nutch?

In my opinion, the main users of Nutch are vertical search engines, and this is the audience that we should cater to. There are many reasons for this:

- Nutch is too complex and too heavy for those that need to crawl up to a few thousand pages. Now that the Droids project exists it's probably not worth the effort to attempt a complete re-design of Nutch so that it fits the need of this group - Nutch is based on map-reduce, and it's not likely we would want to change that, so this means there will always be a significant overhead for small jobs. I'm not saying we should not make Nutch easier to use, but for small crawls Nutch is an overkill. Also, in many cases these users don't realize that they don't do any frontier discovery and expansion, and what they really need is Solr.

- at the other end of the spectrum, there are very very few companies that want to do a wide large web-scale crawling - this is costly, and requires a solid business plan and serious funding. These users are prepared anyway to spend significant effort on customizations and problem-solving, or they may want to use only some parts of Nutch. Often they are also not too eager to contribute back to the project - either because of their proprietary nature or because their customizations are not useful for general audience.

The remaining group is interested in medium-size, high quality crawling (focused, with good spam & junk controls). Which is either an enterprise search or a vertical search. We should make Nutch an attractive platform for such users, and we should discuss what this entails. Also, if we refactor Nutch in the way I described above, it will be easier for such users to contribute back to Nutch and other related projects.

3. Provide a platform for solving the really interesting issues
---------------------------------------------------------------
Nutch has many bits and pieces that implement really smart algorithms and heuristics to solve difficult issues that occur in crawling. The problem is that they are often well hidden and poorly documented, and their interaction with the rest of the system is far from obvious. Sometimes this is related to premature performance optimizations, in other cases this is just a poorly abstracted design. Examples would include the OPIC scoring, meta-tags & metadata handling, deduplication, redirection handling, etc.

Even though these components are usually implemented as plugins, this lack of transparency and poor design makes it difficult to experiment with Nutch. I believe that improving this area will result in many more users contributing back to the project, both from business and from academia.

And there are quite a few interesting challenges to solve:

* crawl scheduling, i.e. determining the order and composition of fetchlists to maximize the crawling speed.

* spam & junk detection (I won't go into details on this, there are tons of literature on the subject)

* crawler trap handling (e.g. the classic calendar page that generates infinite number of pages).

* enterprise-specific ranking and scoring. This includes users' feedback (explicit and implicit, e.g. click-throughs)

* pagelet-level crawling (e.g. portals, RSS feeds, discussion fora)

* near-duplicate detection, and closely related issue of extraction of the main content from a templated page.

* URL aliasing (e.g. www.a.com == a.com == a.com/index.html == a.com/default.asp), and what happens with inlinks to such aliased pages. Also related to this is the problem of temporary/permanent redirects and complete mirrors.

Etc, etc ... I'm pretty sure there are many others. Let's make Nutch an attractive platform to develop and experiment with such components.

-----------------
Briefly ;) that's what comes to my mind when I think about the future of Nutch. I invite you all to share your thoughts and suggestions!

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to