Re: incremental growing index

Andrzej Bialecki Thu, 12 Jul 2007 13:47:41 -0700

Mathijs Homminga wrote:

Hi everyone,
Our crawler generates and fetches segments continuously. We'd like toindex and merge each new segment immediately (or with a small delay)such that our index grows incrementally. This is unlike the normalsituation where one would create a linkdb and an index of all segmentsat once, after the crawl has finished.
The problem we have is that Nutch currently needs the complete linkdband crawldb each time we want to index a single segment.

The reason for wanting the linkdb is the anchor information. If youdon't need any anchor information, you can provide an empty linkdb.

The reason why crawldb is needed is to get the current page statusinformation (which may have changed in the meantime due to subsequentcrawldb updates from newer segments). If you don't need thisinformation, you can modify Indexer.reduce() (~line 212) method to allowfor this, and then remove the line in Indexer.index() that adds crawldbto the list of input paths.

The Indexer map task processes all keys (urls) from the input files(linkdb, crawldb and segment). This includes all data from the linkdband crawldb that we actually don't need since we are only interested inthe data that corresponds to the keys (urls) in our segment (this isfiltered out in the Indexer reduce task).Obviously, as the linkdb and crawldb grow, this becomes more and more ofa problem.

Is this really a problem for you now? Unless your segments are tiny, theindexing process will be dominated by I/O from the processing ofparseText / parseData and Lucene operations.

Any ideas on how to tackle this issue?
Is it feasible to lookup the corresponding linkdb and crawldb data foreach key (url) in the segment before or during indexing?

It would be probably too slow, unless you made a copy of linkdb/crawldbon the local FS-es of each node. But at this point the benefit of thischange would be doubtful, because of all the I/O you would need to do toprepare each task's environment ...



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: incremental growing index

Reply via email to