Re: Generating multiple fetchlists between updates

Doug Cutting Thu, 19 Jan 2006 13:30:40 -0800

Andrzej Bialecki wrote:

In the 0.7 branch, whenever a segment was generated the WebDB wasmodified, so that the entries that ended up in the fetchlist wouldn't beimmediately available to the next segment generation, if that happenedbefore the WebDB was updated with the data from that first segment. Thiswas achieved by adding 1 week to the next fetchTime on a Page.
I can't see that we do it in the trunk. This means that we cannotgenerate more than one fetchlist between the CrawlDB updates, becauseeach fetchlist would be identical to the previous one... Should we worryabout this? There is a cost to modify the CrawlDB, but there is also acost to not be able to generate multiple different fetchlists and fetchthem in parallel...

I think this would be a useful feature to resurrect. I'd vote formaking it optional, at least at first.

Ideally one could run crawldb update and generate jobs in parallel withthe fetch job, so that, as soon as a fetch completes the next can start.


Doug

Re: Generating multiple fetchlists between updates

Reply via email to