thanks, wuqi and Otis, for your information.
---------------------------------------- > Date: Wed, 7 May 2008 07:16:05 -0700 > From: [EMAIL PROTECTED] > Subject: Re: periodically re-crawl several domains with different frequencies > To: [email protected] > > Wuqi's 2. sounds like what we talked about recently - HostDB to tack > host-level information. Fetch frequency could be one of the pieces of data > to store for Generator to use. There is no code in JIRA yet. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > ----- Original Message ---- >> From: wuqi >> To: [email protected] >> Sent: Wednesday, May 7, 2008 2:28:51 AM >> Subject: Re: periodically re-crawl several domains with different frequencies >> >> 1. You can partially re-fetch the pages in to a new segment, and build a >> new >> index for the new segment, and then merge the new index with old one,also >> not >> forget to dedup two indexes before merging.. >> >> 2. If you are not dealing with much hosts,such as server thousand hosts or >> tens >> of thousands host, maybe what I have done can help you.I am using a Lucene >> index >> to store the host information. the index ,the host Lucerne index including >> information like IP address,fetch speed, pageRank of index page and also >> Fetch >> Interval etc. Then you can use the "Fetch Interval" information stored in >> the >> Lucene index during the Generator job. and the fetch interval information >> can be >> changed automatically during the fetch process or just manually. >> >> ----- Original Message ----- >> From: "Marcel T" >> To: >> Sent: Wednesday, May 07, 2008 1:57 PM >> Subject: periodically re-crawl several domains with different frequencies >> >> >> >> Hi, >> I want to crawl and build index for several domains by nutch's intranet >> crawling >> method. Since those domains update from time to time, I want to re-crawl >> them >> periodically but with different frequencies. Say, for domain A, I re-crawl >> it >> every week, but for domain B, re-crawling is done every other day, for >> example. >> Two questions here >> 1) When I do crawling with the same direction, old index is completely >> removed. >> Is there any way I can just update the crawled URLs from the existing index? >> >> 2) How to set different crawling frequency for different domains? Should I >> crawl >> them individually, and merge them? Or I can configure it in nutch? >> >> >> Many thanks! >
