incremental growing index

Mathijs Homminga Wed, 11 Jul 2007 05:50:32 -0700

Hi everyone,

Our crawler generates and fetches segments continuously. We'd like toindex and merge each new segment immediately (or with a small delay)such that our index grows incrementally. This is unlike the normalsituation where one would create a linkdb and an index of all segmentsat once, after the crawl has finished.

The problem we have is that Nutch currently needs the complete linkdband crawldb each time we want to index a single segment.

The Indexer map task processes all keys (urls) from the input files(linkdb, crawldb and segment). This includes all data from the linkdband crawldb that we actually don't need since we are only interested inthe data that corresponds to the keys (urls) in our segment (this isfiltered out in the Indexer reduce task).Obviously, as the linkdb and crawldb grow, this becomes more and more ofa problem.


Any ideas on how to tackle this issue?

Is it feasible to lookup the corresponding linkdb and crawldb data foreach key (url) in the segment before or during indexing?


Thanks!
Mathijs Homminga

--
Knowlogy
Helperpark 290 C
9723 ZA Groningen

[EMAIL PROTECTED]
+31 (0)6 15312977
http://www.knowlogy.nl

incremental growing index

Reply via email to