thanks, wuqi and Otis, for your information.

----------------------------------------
> Date: Wed, 7 May 2008 07:16:05 -0700
> From: [EMAIL PROTECTED]
> Subject: Re: periodically re-crawl several domains with different frequencies
> To: [email protected]
> 
> Wuqi's 2. sounds like what we talked about recently - HostDB to tack 
> host-level information.  Fetch frequency could be one of the pieces of data 
> to store for Generator to use.  There is no code in JIRA yet.
> 
>  Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> ----- Original Message ----
>> From: wuqi 
>> To: [email protected]
>> Sent: Wednesday, May 7, 2008 2:28:51 AM
>> Subject: Re: periodically re-crawl several domains with different frequencies
>> 
>> 1. You can partially re-fetch the pages  in to a new segment, and build a 
>> new 
>> index for the new segment, and then merge the new index with old one,also 
>> not 
>> forget to dedup two indexes before merging..
>> 
>> 2. If  you are not dealing with much hosts,such as server thousand hosts or 
>> tens 
>> of thousands host, maybe what I have done can help you.I am using a Lucene 
>> index 
>> to store the host information. the index ,the host Lucerne index including 
>> information like  IP address,fetch speed, pageRank of index page and also 
>> Fetch 
>> Interval  etc. Then you can use the "Fetch Interval" information stored in 
>> the 
>> Lucene index during the Generator job. and the fetch interval information 
>> can be 
>> changed automatically during the fetch process or just  manually.
>> 
>> ----- Original Message ----- 
>> From: "Marcel T" 
>> To: 
>> Sent: Wednesday, May 07, 2008 1:57 PM
>> Subject: periodically re-crawl several domains with different frequencies
>> 
>> 
>> 
>> Hi,
>> I want to crawl and build index for several domains by nutch's intranet 
>> crawling 
>> method. Since those domains update from time to time, I want to re-crawl 
>> them 
>> periodically but with different frequencies. Say, for domain A, I re-crawl 
>> it 
>> every week, but for domain B, re-crawling is done every other day, for 
>> example. 
>> Two questions here
>> 1) When I do crawling with the same direction, old index is completely 
>> removed. 
>> Is there any way I can just update the crawled URLs from the existing index?
>> 
>> 2) How to set different crawling frequency for different domains? Should I 
>> crawl 
>> them individually, and merge them? Or I can configure it in nutch?
>> 
>> 
>> Many thanks!
> 

Reply via email to