Hi all,
The ApacheCon is over, our release 1.0 has been out already for some
time, so I think it's a good moment to discuss what are the next steps
in Nutch development.
Let me share with you the topics I identified and presented in the
ApacheCon slides, and some topics that are worth discussing based on
various conversations I had there, and the discussions we had on our
mailing list:
1. Avoid duplication of effort
------------------------------
Currently we spend significant effort on implementing functionality that
other projects are dedicated to. Instead of doing the same work, and
sometimes poorly, we should concentrate on delegating and reusing:
* Use Tika for content parsing: this will require some effort and
collaboration with the Tika project, to improve Tika's ability to handle
more complex formats well (e.g. hierarchical compound documents such as
archives, mailboxes, RSS), and to contribute any missing parsers (e.g.
parse-swf).
* Use Solr for indexing & search: it is hard to justify the effort of
developing and maintaining our own search server - Solr offers much more
functionality, configurability, performance and ease of integration than
our relatively primitive search server. Our integration with Solr needs
to be improved so that it's easier to setup and operate.
* Use database-like storage abstraction: this may seem like a serious
departure from the current architecture, but I don't mean that we should
switch to an SQL DB ... what this means is that we should provide an
option to use HBase, as well as the current plain MapFile-s (and perhaps
other types of DBs, such as Berkeley DB or SQL, if it makes sense) as
our storage. There is a very promising initial port of Nutch to HBase,
which is currently closely integrated with HBase API (which is both good
and bad) - it provides several improvements over our current storage, so
I think it's worth using as the new default, but let's see if we can
make it more abstract.
* Plugins: the initial OSGI port looks good, but I'm not sure yet at
this moment if the benefits of OSGI outweigh the cost of this change ...
* Shard management: this is currently an Achilles' heel of Nutch, where
users are left on their own ... If we switch to using HBase then at
least on the crawling side the shard management will become much easier.
This still leaves the problem of deploying new content to search
server(s). The candidate framework for this side of the shard management
is Katta + patches provided by Ted Dunning (see ???). If we switch to
using Solr we would have to also use the Katta / Solr integration, and
perhaps Solr/Hadoop integration as well. This is a complex mix of
half-ready components that needs to be well thought-through ...
* Crawler Commons: during our Crawler MeetUp all representatives agreed
that we should collect a few components that are nearly the same across
all projects and collaborate on their development, and use them as an
external dependency. The candidate components are:
- robots.txt parsing
- URL filtering and normalization
- page signature (fingerprint) implementations
- page template detection & removal (aka. main content extraction)
- possibly others, like URL redirection tracking, PageRank
calculation, crawler trap detection etc.
2. Make Nutch easier to use
---------------------------
This, as you may remember our earlier discussions, begs the question:
who is the target audience of Nutch?
In my opinion, the main users of Nutch are vertical search engines, and
this is the audience that we should cater to. There are many reasons for
this:
- Nutch is too complex and too heavy for those that need to crawl up to
a few thousand pages. Now that the Droids project exists it's probably
not worth the effort to attempt a complete re-design of Nutch so that it
fits the need of this group - Nutch is based on map-reduce, and it's not
likely we would want to change that, so this means there will always be
a significant overhead for small jobs. I'm not saying we should not make
Nutch easier to use, but for small crawls Nutch is an overkill. Also, in
many cases these users don't realize that they don't do any frontier
discovery and expansion, and what they really need is Solr.
- at the other end of the spectrum, there are very very few companies
that want to do a wide large web-scale crawling - this is costly, and
requires a solid business plan and serious funding. These users are
prepared anyway to spend significant effort on customizations and
problem-solving, or they may want to use only some parts of Nutch. Often
they are also not too eager to contribute back to the project - either
because of their proprietary nature or because their customizations are
not useful for general audience.
The remaining group is interested in medium-size, high quality crawling
(focused, with good spam & junk controls). Which is either an enterprise
search or a vertical search. We should make Nutch an attractive platform
for such users, and we should discuss what this entails. Also, if we
refactor Nutch in the way I described above, it will be easier for such
users to contribute back to Nutch and other related projects.
3. Provide a platform for solving the really interesting issues
---------------------------------------------------------------
Nutch has many bits and pieces that implement really smart algorithms
and heuristics to solve difficult issues that occur in crawling. The
problem is that they are often well hidden and poorly documented, and
their interaction with the rest of the system is far from obvious.
Sometimes this is related to premature performance optimizations, in
other cases this is just a poorly abstracted design. Examples would
include the OPIC scoring, meta-tags & metadata handling, deduplication,
redirection handling, etc.
Even though these components are usually implemented as plugins, this
lack of transparency and poor design makes it difficult to experiment
with Nutch. I believe that improving this area will result in many more
users contributing back to the project, both from business and from
academia.
And there are quite a few interesting challenges to solve:
* crawl scheduling, i.e. determining the order and composition of
fetchlists to maximize the crawling speed.
* spam & junk detection (I won't go into details on this, there are tons
of literature on the subject)
* crawler trap handling (e.g. the classic calendar page that generates
infinite number of pages).
* enterprise-specific ranking and scoring. This includes users' feedback
(explicit and implicit, e.g. click-throughs)
* pagelet-level crawling (e.g. portals, RSS feeds, discussion fora)
* near-duplicate detection, and closely related issue of extraction of
the main content from a templated page.
* URL aliasing (e.g. www.a.com == a.com == a.com/index.html ==
a.com/default.asp), and what happens with inlinks to such aliased pages.
Also related to this is the problem of temporary/permanent redirects and
complete mirrors.
Etc, etc ... I'm pretty sure there are many others. Let's make Nutch an
attractive platform to develop and experiment with such components.
-----------------
Briefly ;) that's what comes to my mind when I think about the future of
Nutch. I invite you all to share your thoughts and suggestions!
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com