Re: [Nutch-dev] incremental crawling

Matt Kangas Thu, 01 Dec 2005 12:42:19 -0800

#2 should be a pluggable/hookable parameter. "high-scoring" soundslike a reasonable default basis for choosing recrawl intervals, butI'm sure that nearly everyone will think of a way to improve uponthat for their particular system.


e.g. "high-scoring" ain't gonna cut it for my needs. (0.5 wink ;)


--matt

On Dec 1, 2005, at 2:15 PM, Doug Cutting wrote:

It would be good to improve the support for incremental crawlingadded to Nutch. Here are some ideas about how we might implementit. Andrzej has posted in the past about this, so he probably hasbetter ideas.
Incremental crawling could proceed as follows:
1. Bootstrap with a batch crawl, using the 'crawl' command. ModifyCrawlDatum to store the MD5Hash of the content of fetched urls.
2. Reduce the fetch interval for high-scoring urls. If the defaultis monthly, then the top-scoring 1% of urls might be set to daily,and the top-scoring 10% of urls might be set to weekly.
3. Generate a fetch list & fetch it. When the url has beenpreviously fetched, and its content is unchanged, increase itsfetch interval by an amount, e.g., 50%. If the content is changed,decrease the fetch interval. The percentage of increase anddecrease might be influenced by the url's score.
4. Update the crawl db & link db, index the new segment, dedup,etc. When updating the crawl db, scores for existing urls shouldnot change, since the scoring method we're using (OPIC) assumeseach page is fetched only once.
Steps 3 & 4 can be packaged as an 'update' command. Step 2 can beincluded in the 'crawl' command, so that crawled indexes are alwaysready for update.
Comments?

Doug


--
Matt Kangas / [EMAIL PROTECTED]

Re: [Nutch-dev] incremental crawling

Reply via email to