Hi everyone,

Our crawler generates and fetches segments continuously. We'd like to index and merge each new segment immediately (or with a small delay) such that our index grows incrementally. This is unlike the normal situation where one would create a linkdb and an index of all segments at once, after the crawl has finished.

The problem we have is that Nutch currently needs the complete linkdb and crawldb each time we want to index a single segment.

The Indexer map task processes all keys (urls) from the input files (linkdb, crawldb and segment). This includes all data from the linkdb and crawldb that we actually don't need since we are only interested in the data that corresponds to the keys (urls) in our segment (this is filtered out in the Indexer reduce task). Obviously, as the linkdb and crawldb grow, this becomes more and more of a problem.

Any ideas on how to tackle this issue?
Is it feasible to lookup the corresponding linkdb and crawldb data for each key (url) in the segment before or during indexing?

Thanks!
Mathijs Homminga

--
Knowlogy
Helperpark 290 C
9723 ZA Groningen

[EMAIL PROTECTED]
+31 (0)6 15312977
http://www.knowlogy.nl


Reply via email to