Re: incremental growing index

Mathijs Homminga Mon, 16 Jul 2007 03:46:57 -0700

Thanks Andrzej,

Perhaps these numbers make our issue more clear:

- after a week of (internet) crawling, the crawldb contains about 22Mdocuments.

- 6M documents are fetched, in 257 segments (topN = 25,000)
- size of the crawldb = 4,399 MB (22M docs, 0.2 kB/doc)
- size of the linkdb = 75,955 MB (22M docs, 3.5 kB/doc)

- size of a segment = somewhere between 100 and 500 MB (25K docs, 20kB/doc (max))

As you can see: for a segment of 500 MB, more than 99% of the IO duringindexing is due to the linkdb and crawldb.We could increase the size of our segments, but in the end this onlydelays the problem.

We are now indexing without the linkdb. This reduces the time needed bya factor 10. But we would really like to have the link texts back inagain in the future.


Thanks,
Mathijs

Andrzej Bialecki wrote:

Mathijs Homminga wrote:
Hi everyone,
Our crawler generates and fetches segments continuously. We'd like toindex and merge each new segment immediately (or with a small delay)such that our index grows incrementally. This is unlike the normalsituation where one would create a linkdb and an index of allsegments at once, after the crawl has finished.
The problem we have is that Nutch currently needs the complete linkdband crawldb each time we want to index a single segment.
The reason for wanting the linkdb is the anchor information. If youdon't need any anchor information, you can provide an empty linkdb.
The reason why crawldb is needed is to get the current page statusinformation (which may have changed in the meantime due to subsequentcrawldb updates from newer segments). If you don't need thisinformation, you can modify Indexer.reduce() (~line 212) method toallow for this, and then remove the line in Indexer.index() that addscrawldb to the list of input paths.
The Indexer map task processes all keys (urls) from the input files(linkdb, crawldb and segment). This includes all data from the linkdband crawldb that we actually don't need since we are only interestedin the data that corresponds to the keys (urls) in our segment (thisis filtered out in the Indexer reduce task).Obviously, as the linkdb and crawldb grow, this becomes more and moreof a problem.
Is this really a problem for you now? Unless your segments are tiny,the indexing process will be dominated by I/O from the processing ofparseText / parseData and Lucene operations.
Any ideas on how to tackle this issue?
Is it feasible to lookup the corresponding linkdb and crawldb datafor each key (url) in the segment before or during indexing?
It would be probably too slow, unless you made a copy oflinkdb/crawldb on the local FS-es of each node. But at this point thebenefit of this change would be doubtful, because of all the I/O youwould need to do to prepare each task's environment ...


--
Knowlogy
Helperpark 290 C
9723 ZA Groningen

[EMAIL PROTECTED]
+31 (0)6 15312977
http://www.knowlogy.nl

Re: incremental growing index

Reply via email to