Thank you, that sounds like a close match to what I am looking for.

It looks like it is part of 1.0, I am only using 0.9 at this time.

But I think I read that a RC1 was coming out soon, so might use that, but I will download a nightly build and play with that.

My site isn't production yet, but we are building a large index (Right now it is sitting at around ~3.5 million urls), is Nutch-trunk stable enough to use to do the fetching and indexing?

-John


On Mar 3, 2009, at 11:11 AM, Bartosz Gadzimski wrote:

Oh, I forgot. I didn't test that one so can tell you how it works.

I know that many people are makeing generate, fetch, etc. loops very often to make sites fresh

John Martyniak pisze:
Justin,

thanks for the info this very helpful.

This value would apply to all pages though. I was thinking that if you have things like youtube.com, cnn.com, etc in your index you would probably want them to be re-fetched more frequently. So I was wondering if there was some filter or other plugin, that is in nutch that will let you specify that value.

Once option I was thinking of was creating a list of the URLs and then import that when creating the segment to be fetched. But that would include a lot of outside processing.

-John

On Mar 3, 2009, at 10:51 AM, Justin Yao wrote:

Hi John,

You can find below parameters in conf/nutch-default.xml. You can change the value and put your own one in conf/nutch-site.xml


<property>
<name>db.default.fetch.interval</name>
<value>30</value>
<description>(DEPRECATED) The default number of days between re- fetches of a page.
</description>
</property>

<property>
<name>db.fetch.interval.default</name>
<value>2592000</value>
<description>The default number of seconds between re-fetches of a page (30 days).
</description>
</property>

<property>
<name>db.fetch.interval.max</name>
<value>7776000</value>
<description>The maximum number of seconds between re-fetches of a page (90 days). After this period every page in the db will be re- tried, no
matter what is its status.
</description>
</property>


Justin

John Martyniak wrote:
How does nutch determine when content needs to be re-fetched? The way that I understand it is that it is "next fetch" date which 7 days in the future. Is there anyway to change that? Or to increase the fetching interval. Or somehow base it on how many times a piece of content is requested. I would like to keep the content as fresh as possible, and the information changes more frequently than every 7 days.
Thanks in advance,
-John




Reply via email to