Wuqi's 2. sounds like what we talked about recently - HostDB to tack host-level 
information.  Fetch frequency could be one of the pieces of data to store for 
Generator to use.  There is no code in JIRA yet.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: wuqi <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Wednesday, May 7, 2008 2:28:51 AM
> Subject: Re: periodically re-crawl several domains with different frequencies
> 
> 1. You can partially re-fetch the pages  in to a new segment, and build a new 
> index for the new segment, and then merge the new index with old one,also not 
> forget to dedup two indexes before merging..
> 
> 2. If  you are not dealing with much hosts,such as server thousand hosts or 
> tens 
> of thousands host, maybe what I have done can help you.I am using a Lucene 
> index 
> to store the host information. the index ,the host Lucerne index including 
> information like  IP address,fetch speed, pageRank of index page and also 
> Fetch 
> Interval  etc. Then you can use the "Fetch Interval" information stored in 
> the 
> Lucene index during the Generator job. and the fetch interval information can 
> be 
> changed automatically during the fetch process or just  manually.
> 
> ----- Original Message ----- 
> From: "Marcel T" 
> To: 
> Sent: Wednesday, May 07, 2008 1:57 PM
> Subject: periodically re-crawl several domains with different frequencies
> 
> 
> 
> Hi,
> I want to crawl and build index for several domains by nutch's intranet 
> crawling 
> method. Since those domains update from time to time, I want to re-crawl them 
> periodically but with different frequencies. Say, for domain A, I re-crawl it 
> every week, but for domain B, re-crawling is done every other day, for 
> example. 
> Two questions here
> 1) When I do crawling with the same direction, old index is completely 
> removed. 
> Is there any way I can just update the crawled URLs from the existing index?
> 
> 2) How to set different crawling frequency for different domains? Should I 
> crawl 
> them individually, and merge them? Or I can configure it in nutch?
> 
> 
> Many thanks!

Reply via email to