Yanky,
thanks for the input, but I think that I am going to try the straight
AdaptiveFetchSchedule. I think that is a good idea, it might be a
nice extension to have so that the user has more control over the
adaptive schedule.
-John
On Mar 3, 2009, at 12:15 PM, yanky young wrote:
Hi:
if you want adaptive fetching strategy only for specific domains,
you can do
this:
write your own another *AdaptiveFetchSchedule*, see
MyAdaptiveFetchSchedule
MyAdaptiveFetchSchedule extends *AdaptiveFetchSchedule {
*void *setConf<http://www.netlikon.de/docs/javadoc-nutch-trunk/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#setConf%28org.apache.hadoop.conf.Configuration%29
>
*(org.apache.hadoop.conf.Configuration conf)
CrawlDatum<http://www.netlikon.de/docs/javadoc-nutch-trunk/org/apache/nutch/crawl/CrawlDatum.html
>
*setFetchSchedule<http://www.netlikon.de/docs/javadoc-nutch-trunk/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#setFetchSchedule%28org.apache.hadoop.io.Text,%20org.apache.nutch.crawl.CrawlDatum,%20long,%20long,%20long,%20long,%20int%29
>
*(org.apache.hadoop.io.Text url,
CrawlDatum<http://www.netlikon.de/docs/javadoc-nutch-trunk/org/apache/nutch/crawl/CrawlDatum.html
>
datum,
long prevFetchTime, long prevModifiedTime, long fetchTime,
long modifiedTime, int state)
}
sorry, i just copy the two methods from API doc :-)
you have two thing to do:
(1) write a new configuration file, see refetch.domains.txt, and add
all
those domains you want to refetch frequently into this config file,
and then
add a new property in nutch-site.xml, like below:
<property>
<name>refetch.domains.file</name>
<value>refetch.domains.txt</value>
</property>
add a new field in MyAdaptiveFetchSchedule, see refetchDomains, and
add code
in setConf method to get refetch.domains.file property and read
refetch.domains.txt file to refetchDomains
(2) override setFetchSchedule<http://www.netlikon.de/docs/javadoc-nutch-trunk/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#setFetchSchedule%28org.apache.hadoop.io.Text,%20org.apache.nutch.crawl.CrawlDatum,%20long,%20long,%20long,%20long,%20int%29
>method,
like this:
*
if (refetchDomains.contains(DomainUtil.getDomain(url))) {
// modify schedule to a bigger frequency and return;
}
return super();
*
I don't test the code, hope it works.
good luck
yanky*
*2009/3/4 Bartosz Gadzimski <[email protected]>
Oh, I forgot. I didn't test that one so can tell you how it works.
I know that many people are makeing generate, fetch, etc. loops
very often
to make sites fresh
John Martyniak pisze:- 显示引用文字 -
Justin,
thanks for the info this very helpful.
This value would apply to all pages though. I was thinking that
if you
have things like youtube.com, cnn.com, etc in your index you would
probably want them to be re-fetched more frequently. So I was
wondering if
there was some filter or other plugin, that is in nutch that will
let you
specify that value.
Once option I was thinking of was creating a list of the URLs and
then
import that when creating the segment to be fetched. But that
would include
a lot of outside processing.
-John
On Mar 3, 2009, at 10:51 AM, Justin Yao wrote:
Hi John,
You can find below parameters in conf/nutch-default.xml. You can
change
the value and put your own one in conf/nutch-site.xml
<property>
<name>db.default.fetch.interval</name>
<value>30</value>
<description>(DEPRECATED) The default number of days between re-
fetches
of a page.
</description>
</property>
<property>
<name>db.fetch.interval.default</name>
<value>2592000</value>
<description>The default number of seconds between re-fetches of
a page
(30 days).
</description>
</property>
<property>
<name>db.fetch.interval.max</name>
<value>7776000</value>
<description>The maximum number of seconds between re-fetches of
a page
(90 days). After this period every page in the db will be re-
tried, no
matter what is its status.
</description>
</property>
Justin
John Martyniak wrote:
How does nutch determine when content needs to be re-fetched?
The way
that I understand it is that it is "next fetch" date which 7
days in the
future.
Is there anyway to change that? Or to increase the fetching
interval.
Or somehow base it on how many times a piece of content is
requested.
I would like to keep the content as fresh as possible, and the
information changes more frequently than every 7 days.
Thanks in advance,
-John