[Nutch-dev] incremental crawling

Doug Cutting Thu, 01 Dec 2005 11:21:03 -0800

It would be good to improve the support for incremental crawling addedto Nutch. Here are some ideas about how we might implement it. Andrzejhas posted in the past about this, so he probably has better ideas.


Incremental crawling could proceed as follows:

1. Bootstrap with a batch crawl, using the 'crawl' command. ModifyCrawlDatum to store the MD5Hash of the content of fetched urls.

2. Reduce the fetch interval for high-scoring urls. If the default ismonthly, then the top-scoring 1% of urls might be set to daily, and thetop-scoring 10% of urls might be set to weekly.

3. Generate a fetch list & fetch it. When the url has been previouslyfetched, and its content is unchanged, increase its fetch interval by anamount, e.g., 50%. If the content is changed, decrease the fetchinterval. The percentage of increase and decrease might be influencedby the url's score.

4. Update the crawl db & link db, index the new segment, dedup, etc.When updating the crawl db, scores for existing urls should not change,since the scoring method we're using (OPIC) assumes each page is fetchedonly once.

Steps 3 & 4 can be packaged as an 'update' command. Step 2 can beincluded in the 'crawl' command, so that crawled indexes are alwaysready for update.


Comments?

Doug


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] incremental crawling

Reply via email to