1. You can partially re-fetch the pages  in to a new segment, and build a new 
index for the new segment, and then merge the new index with old one,also not 
forget to dedup two indexes before merging..

2. If  you are not dealing with much hosts,such as server thousand hosts or 
tens of thousands host, maybe what I have done can help you.I am using a Lucene 
index to store the host information. the index ,the host Lucerne index 
including information like  IP address,fetch speed, pageRank of index page and 
also Fetch Interval  etc. Then you can use the "Fetch Interval" information 
stored in the Lucene index during the Generator job. and the fetch interval 
information can be changed automatically during the fetch process or just  
manually.

----- Original Message ----- 
From: "Marcel T" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Wednesday, May 07, 2008 1:57 PM
Subject: periodically re-crawl several domains with different frequencies



Hi,
I want to crawl and build index for several domains by nutch's intranet 
crawling method. Since those domains update from time to time, I want to 
re-crawl them periodically but with different frequencies. Say, for domain A, I 
re-crawl it every week, but for domain B, re-crawling is done every other day, 
for example. Two questions here
1) When I do crawling with the same direction, old index is completely removed. 
Is there any way I can just update the crawled URLs from the existing index?

2) How to set different crawling frequency for different domains? Should I 
crawl them individually, and merge them? Or I can configure it in nutch?


Many thanks!

Reply via email to