Hi,
You can use Adaptive class and it theory your site will be very freash
Change org.apache.nutch.crawl.DefaultFetchSchedule to
org.apache.nutch.crawl.AdaptiveFetchSchedule
In nutch-default.xml you have bunch of options for this class
<property>
<name>db.fetch.schedule.class</name>
<value>org.apache.nutch.crawl.DefaultFetchSchedule</value>
<description>The implementation of fetch schedule.
DefaultFetchSchedule simply
adds the original fetchInterval to the last fetch time, regardless of
page changes.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.inc_rate</name>
<value>0.4</value>
<description>If a page is unmodified, its fetchInterval will be
increased by this rate. This value should not
exceed 0.5, otherwise the algorithm becomes unstable.</description>
</property>
John Martyniak pisze:
Justin,
thanks for the info this very helpful.
This value would apply to all pages though. I was thinking that if
you have things like youtube.com, cnn.com, etc in your index you would
probably want them to be re-fetched more frequently. So I was
wondering if there was some filter or other plugin, that is in nutch
that will let you specify that value.
Once option I was thinking of was creating a list of the URLs and then
import that when creating the segment to be fetched. But that would
include a lot of outside processing.
-John
On Mar 3, 2009, at 10:51 AM, Justin Yao wrote:
Hi John,
You can find below parameters in conf/nutch-default.xml. You can
change the value and put your own one in conf/nutch-site.xml
<property>
<name>db.default.fetch.interval</name>
<value>30</value>
<description>(DEPRECATED) The default number of days between
re-fetches of a page.
</description>
</property>
<property>
<name>db.fetch.interval.default</name>
<value>2592000</value>
<description>The default number of seconds between re-fetches of a
page (30 days).
</description>
</property>
<property>
<name>db.fetch.interval.max</name>
<value>7776000</value>
<description>The maximum number of seconds between re-fetches of a page
(90 days). After this period every page in the db will be re-tried, no
matter what is its status.
</description>
</property>
Justin
John Martyniak wrote:
How does nutch determine when content needs to be re-fetched? The
way that I understand it is that it is "next fetch" date which 7
days in the future.
Is there anyway to change that? Or to increase the fetching
interval. Or somehow base it on how many times a piece of content
is requested.
I would like to keep the content as fresh as possible, and the
information changes more frequently than every 7 days.
Thanks in advance,
-John