1. You can partially re-fetch the pages in to a new segment, and build a new index for the new segment, and then merge the new index with old one,also not forget to dedup two indexes before merging..
2. If you are not dealing with much hosts,such as server thousand hosts or tens of thousands host, maybe what I have done can help you.I am using a Lucene index to store the host information. the index ,the host Lucerne index including information like IP address,fetch speed, pageRank of index page and also Fetch Interval etc. Then you can use the "Fetch Interval" information stored in the Lucene index during the Generator job. and the fetch interval information can be changed automatically during the fetch process or just manually. ----- Original Message ----- From: "Marcel T" <[EMAIL PROTECTED]> To: <[email protected]> Sent: Wednesday, May 07, 2008 1:57 PM Subject: periodically re-crawl several domains with different frequencies Hi, I want to crawl and build index for several domains by nutch's intranet crawling method. Since those domains update from time to time, I want to re-crawl them periodically but with different frequencies. Say, for domain A, I re-crawl it every week, but for domain B, re-crawling is done every other day, for example. Two questions here 1) When I do crawling with the same direction, old index is completely removed. Is there any way I can just update the crawled URLs from the existing index? 2) How to set different crawling frequency for different domains? Should I crawl them individually, and merge them? Or I can configure it in nutch? Many thanks!
