Wuqi's 2. sounds like what we talked about recently - HostDB to tack host-level information. Fetch frequency could be one of the pieces of data to store for Generator to use. There is no code in JIRA yet.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: wuqi <[EMAIL PROTECTED]> > To: [email protected] > Sent: Wednesday, May 7, 2008 2:28:51 AM > Subject: Re: periodically re-crawl several domains with different frequencies > > 1. You can partially re-fetch the pages in to a new segment, and build a new > index for the new segment, and then merge the new index with old one,also not > forget to dedup two indexes before merging.. > > 2. If you are not dealing with much hosts,such as server thousand hosts or > tens > of thousands host, maybe what I have done can help you.I am using a Lucene > index > to store the host information. the index ,the host Lucerne index including > information like IP address,fetch speed, pageRank of index page and also > Fetch > Interval etc. Then you can use the "Fetch Interval" information stored in > the > Lucene index during the Generator job. and the fetch interval information can > be > changed automatically during the fetch process or just manually. > > ----- Original Message ----- > From: "Marcel T" > To: > Sent: Wednesday, May 07, 2008 1:57 PM > Subject: periodically re-crawl several domains with different frequencies > > > > Hi, > I want to crawl and build index for several domains by nutch's intranet > crawling > method. Since those domains update from time to time, I want to re-crawl them > periodically but with different frequencies. Say, for domain A, I re-crawl it > every week, but for domain B, re-crawling is done every other day, for > example. > Two questions here > 1) When I do crawling with the same direction, old index is completely > removed. > Is there any way I can just update the crawled URLs from the existing index? > > 2) How to set different crawling frequency for different domains? Should I > crawl > them individually, and merge them? Or I can configure it in nutch? > > > Many thanks!
