Hi:

if you want adaptive fetching strategy only for specific domains, you can do
this:

write your own another *AdaptiveFetchSchedule*, see MyAdaptiveFetchSchedule

   MyAdaptiveFetchSchedule extends *AdaptiveFetchSchedule {

*void 
*setConf<http://www.netlikon.de/docs/javadoc-nutch-trunk/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#setConf%28org.apache.hadoop.conf.Configuration%29>
*(org.apache.hadoop.conf.Configuration conf)
CrawlDatum<http://www.netlikon.de/docs/javadoc-nutch-trunk/org/apache/nutch/crawl/CrawlDatum.html>
*setFetchSchedule<http://www.netlikon.de/docs/javadoc-nutch-trunk/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#setFetchSchedule%28org.apache.hadoop.io.Text,%20org.apache.nutch.crawl.CrawlDatum,%20long,%20long,%20long,%20long,%20int%29>
*(org.apache.hadoop.io.Text url,
CrawlDatum<http://www.netlikon.de/docs/javadoc-nutch-trunk/org/apache/nutch/crawl/CrawlDatum.html>
datum,
long prevFetchTime, long prevModifiedTime, long fetchTime,
long modifiedTime, int state)
   }

sorry, i just copy the two methods from API doc :-)

you have two thing to do:

(1) write a new configuration file, see refetch.domains.txt, and add all
those domains you want to refetch frequently into this config file, and then
add a new property in nutch-site.xml, like below:

<property>
 <name>refetch.domains.file</name>
 <value>refetch.domains.txt</value>
</property>

add a new field in MyAdaptiveFetchSchedule, see refetchDomains, and add code
in setConf method to get refetch.domains.file property and read
refetch.domains.txt file to refetchDomains

(2) override 
setFetchSchedule<http://www.netlikon.de/docs/javadoc-nutch-trunk/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#setFetchSchedule%28org.apache.hadoop.io.Text,%20org.apache.nutch.crawl.CrawlDatum,%20long,%20long,%20long,%20long,%20int%29>method,
like this:
*

if (refetchDomains.contains(DomainUtil.getDomain(url))) {
  // modify schedule to a bigger frequency and return;
}
return super();
*
I don't test the code, hope it works.

good luck

yanky*

*2009/3/4 Bartosz Gadzimski <[email protected]>

> Oh, I forgot. I didn't test that one so can tell you how it works.
>
> I know that many people are makeing generate, fetch, etc. loops very often
> to make sites fresh
>
> John Martyniak pisze:- 显示引用文字 -
>
>  Justin,
>>
>> thanks for the info this very helpful.
>>
>> This value would apply to all pages though.  I was thinking that if you
>> have things like youtube.com, cnn.com, etc in your index you would
>> probably want them to be re-fetched more frequently.  So I was wondering if
>> there was some filter or other plugin, that is in nutch that will let you
>> specify that value.
>>
>> Once option I was thinking of was creating a list of the URLs and then
>> import that when creating the segment to be fetched.  But that would include
>> a lot of outside processing.
>>
>> -John
>>
>> On Mar 3, 2009, at 10:51 AM, Justin Yao wrote:
>>
>>  Hi John,
>>>
>>> You can find below parameters in conf/nutch-default.xml. You can change
>>> the value and put your own one in conf/nutch-site.xml
>>>
>>>
>>> <property>
>>>  <name>db.default.fetch.interval</name>
>>>  <value>30</value>
>>>  <description>(DEPRECATED) The default number of days between re-fetches
>>> of a page.
>>>  </description>
>>> </property>
>>>
>>> <property>
>>>  <name>db.fetch.interval.default</name>
>>>  <value>2592000</value>
>>>  <description>The default number of seconds between re-fetches of a page
>>> (30 days).
>>>  </description>
>>> </property>
>>>
>>> <property>
>>>  <name>db.fetch.interval.max</name>
>>>  <value>7776000</value>
>>>  <description>The maximum number of seconds between re-fetches of a page
>>>  (90 days). After this period every page in the db will be re-tried, no
>>>  matter what is its status.
>>>  </description>
>>> </property>
>>>
>>>
>>> Justin
>>>
>>> John Martyniak wrote:
>>>
>>>> How does nutch determine when content needs to be re-fetched?  The way
>>>> that I understand it is that it is "next fetch" date which 7 days in the
>>>> future.
>>>> Is there anyway to change that?  Or to increase the fetching interval.
>>>>  Or somehow base it on how many times a piece of content is requested.
>>>> I would like to keep the content as fresh as possible, and the
>>>> information changes more frequently than every 7 days.
>>>> Thanks in advance,
>>>> -John
>>>>
>>>
>>
>>
>

Reply via email to