On 2010-06-29 08:34, xiao yang wrote:
> Hi, guys.
> 
> The current solution is rebuilding the whole index. Is it possible to
> index incrementally, which means we only need to index the newly
> crawled web pages, so we can update the index more frequently.

A common setup for incremental indexing is to create indexes per Nutch
segment (shard), so that you index only the latest segment, and deploy
the latest index.

> The problem is: newly crawled web pages contain anchors of old pages,
> so I have to update the old index too.
> Are there any better solutions?

This is a difficult question. Currently there is no good way to account
for new anchors to old pages without rebuilding the whole index. Whether
that's a serious shortcoming or not, depends on the turn-around cycle of
your whole index (how often you reindex all data anyway because it was
updated), and the amount of anchors you already collected (so that the
anchors that you collected so far are "representative enough" to ignore
the missing new anchors).

Indexing will be redesigned in Nutch 2.0, but at this moment we don't
have any solution yet to this particular issue, which would work
incrementally. Any ideas are welcome, but you need to keep in mind the
following:

* the link inversion step: new or updated anchor text needs to be
associated with target documents. This means that adding data from a
single source page may affect potentially thousands target documents,
and index data related to these documents will have to be updated too...

* currently Lucene / Solr doesn't offer incremental updates of field
content. This means that any update operations require adding full
document and deleting its old version. This is much more convenient to
do when using Solr than with the current Lucene-based Nutch back-end.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to