Yanky,

thanks for the input, but I think that I am going to try the straight AdaptiveFetchSchedule. I think that is a good idea, it might be a nice extension to have so that the user has more control over the adaptive schedule.

-John

On Mar 3, 2009, at 12:15 PM, yanky young wrote:

Hi:

if you want adaptive fetching strategy only for specific domains, you can do
this:

write your own another *AdaptiveFetchSchedule*, see MyAdaptiveFetchSchedule

  MyAdaptiveFetchSchedule extends *AdaptiveFetchSchedule {

*void *setConf<http://www.netlikon.de/docs/javadoc-nutch-trunk/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#setConf%28org.apache.hadoop.conf.Configuration%29 >
*(org.apache.hadoop.conf.Configuration conf)
CrawlDatum<http://www.netlikon.de/docs/javadoc-nutch-trunk/org/apache/nutch/crawl/CrawlDatum.html > *setFetchSchedule<http://www.netlikon.de/docs/javadoc-nutch-trunk/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#setFetchSchedule%28org.apache.hadoop.io.Text,%20org.apache.nutch.crawl.CrawlDatum,%20long,%20long,%20long,%20long,%20int%29 >
*(org.apache.hadoop.io.Text url,
CrawlDatum<http://www.netlikon.de/docs/javadoc-nutch-trunk/org/apache/nutch/crawl/CrawlDatum.html >
datum,
long prevFetchTime, long prevModifiedTime, long fetchTime,
long modifiedTime, int state)
  }

sorry, i just copy the two methods from API doc :-)

you have two thing to do:

(1) write a new configuration file, see refetch.domains.txt, and add all those domains you want to refetch frequently into this config file, and then
add a new property in nutch-site.xml, like below:

<property>
<name>refetch.domains.file</name>
<value>refetch.domains.txt</value>
</property>

add a new field in MyAdaptiveFetchSchedule, see refetchDomains, and add code
in setConf method to get refetch.domains.file property and read
refetch.domains.txt file to refetchDomains

(2) override setFetchSchedule<http://www.netlikon.de/docs/javadoc-nutch-trunk/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#setFetchSchedule%28org.apache.hadoop.io.Text,%20org.apache.nutch.crawl.CrawlDatum,%20long,%20long,%20long,%20long,%20int%29 >method,
like this:
*

if (refetchDomains.contains(DomainUtil.getDomain(url))) {
 // modify schedule to a bigger frequency and return;
}
return super();
*
I don't test the code, hope it works.

good luck

yanky*

*2009/3/4 Bartosz Gadzimski <[email protected]>

Oh, I forgot. I didn't test that one so can tell you how it works.

I know that many people are makeing generate, fetch, etc. loops very often
to make sites fresh

John Martyniak pisze:- 显示引用文字 -

Justin,

thanks for the info this very helpful.

This value would apply to all pages though. I was thinking that if you
have things like youtube.com, cnn.com, etc in your index you would
probably want them to be re-fetched more frequently. So I was wondering if there was some filter or other plugin, that is in nutch that will let you
specify that value.

Once option I was thinking of was creating a list of the URLs and then import that when creating the segment to be fetched. But that would include
a lot of outside processing.

-John

On Mar 3, 2009, at 10:51 AM, Justin Yao wrote:

Hi John,

You can find below parameters in conf/nutch-default.xml. You can change
the value and put your own one in conf/nutch-site.xml


<property>
<name>db.default.fetch.interval</name>
<value>30</value>
<description>(DEPRECATED) The default number of days between re- fetches
of a page.
</description>
</property>

<property>
<name>db.fetch.interval.default</name>
<value>2592000</value>
<description>The default number of seconds between re-fetches of a page
(30 days).
</description>
</property>

<property>
<name>db.fetch.interval.max</name>
<value>7776000</value>
<description>The maximum number of seconds between re-fetches of a page (90 days). After this period every page in the db will be re- tried, no
matter what is its status.
</description>
</property>


Justin

John Martyniak wrote:

How does nutch determine when content needs to be re-fetched? The way that I understand it is that it is "next fetch" date which 7 days in the
future.
Is there anyway to change that? Or to increase the fetching interval. Or somehow base it on how many times a piece of content is requested.
I would like to keep the content as fresh as possible, and the
information changes more frequently than every 7 days.
Thanks in advance,
-John






Reply via email to