Re: incremental crawling

Andrzej Bialecki Fri, 02 Dec 2005 01:13:31 -0800

Doug Cutting wrote:

It would be good to improve the support for incremental crawling addedto Nutch. Here are some ideas about how we might implement it.Andrzej has posted in the past about this, so he probably has betterideas.
Incremental crawling could proceed as follows:
1. Bootstrap with a batch crawl, using the 'crawl' command. ModifyCrawlDatum to store the MD5Hash of the content of fetched urls.

Yes, this is required to detect unmodified content. A small note: plainMD5Hash(byte[] content) is quite ineffective for many pages, e.g. pageswith a counter, or with ads. It would be good to provide a framework forother implementations of "page equality" - for now perhaps we shouldjust say that this value is a byte[], and not specifically an MD5Hash.


Other additions to CrawlDatum for consideration:

* last modified time, not just the last fetched time - these two aredifferent, and the fetching policy will depend on both. E.g. tosynchronize with the page change cycle it is necessary to know the timeof the previous modification seen by Nutch. I've done simulations, whichshow that if we don't track this value then the fetchIntervaladjustments won't stabilize even if the page change cycle is fixed.

* segment name from the last updatedb. I'm not fully convinced aboutthis, but consider the following:

I think this is needed in order to check which segments may be safelydeleted, because there are no more active pages in them. If you enable avariable fetchInterval, then after a while you will end up with widelyranging intervals - some pages will have a daily or hourly period, someothers will have a period of several months. Add to this the fact thatyou start counting the time for each page at different moments, and thenthe oldest page you have could be as old as maxFetchInterval (whateverthat is, Float.MAX_VALUE or some other maximum you set). Most likelysuch old pages would live in segments with very little current data.

Now, you need to minimize the number of active segments (because ofsearch performance and the time to deduplicate). However, with variablefetchInterval you no longer know which segments it is safe to delete. Iimagine a tool could collect all segment names from CrawlDB, and preparea list (segmentName, numRecords). Those segments that are not found onthis list it would be safe to delete. Those segments that have fewrecords could be processed to extract those records and move them to asingle segment (and discard the rest of old segment data).

..

Alternatively, we could add Properties to CrawlDatum, and let people putwhatever they wish there...

2. Reduce the fetch interval for high-scoring urls. If the default ismonthly, then the top-scoring 1% of urls might be set to daily, andthe top-scoring 10% of urls might be set to weekly.

In the original patchset I had a notion of pluggable FetchSchedule-s. Ithink this would be an ideal place to make such decisions.Implementations would be pluggable in a similar way as URLFilter, withthe DefaultFetchSchedule doing what we do today.

3. Generate a fetch list & fetch it. When the url has been previouslyfetched, and its content is unchanged, increase its fetch interval byan amount, e.g., 50%. If the content is changed, decrease the fetchinterval. The percentage of increase and decrease might be influencedby the url's score.


Again, that's the task for a FetchSchedule.

4. Update the crawl db & link db, index the new segment, dedup, etc.When updating the crawl db, scores for existing urls should notchange, since the scoring method we're using (OPIC) assumes each pageis fetched only once.

I would love to refactor this part too, to make the scoring mechanismabstracted in a similar way, so that you could plug in different scoringimplementations. The float value in CrawlDatum is opaque enough tosupport different scoring mechanisms.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: incremental crawling

Reply via email to